Zarr is a revolutionary library for managing large-scale numerical datasets using chunked, compressed N-dimensional arrays. Designed for modern computing needs, it enables efficient parallel I/O operations across diverse storage systems while maintaining NumPy-like semantics. Key advancements in v3 include native cloud storage integration, asynchronous I/O operations, and a 300% performance boost over previous versions
Zarr supports both array-based data storage and flexible data structures, enabling seamless integration with high-performance computing systems.
You can install Zarr via pip
:
pip install zarr
by conda
:
conda install --channel conda-forge zarr
Ensure you have the required dependencies installed for full functionality, such as h5py
for
HDF5 support or numcodecs
for compression algorithms.
Verify installation:
import zarr
print(zarr.__version__) # Should output ≥3.0.3
Arrays are container of items of the same data-type & size (in bits). The number of dimensions and items in container are described by the shape.
But what if the array size is too large to fit into memory? Here comes Zarr!
It divides the whole array into chunks (chunking):
Then compresses each chunk (over 20 supported compressors e.g: BLOSC, Zstd, Zlib etc):
Retrieve chunks only when needed:
This visualization provides a clear understanding of how the array is stored in memory. Each chunk is stored as an entry in a dictionary, where keys follow a structured format like 0.0, 0.1, etc. The corresponding values are stored in binary format, optimizing storage and retrieval efficiency.:
Chunking allows you to split large arrays into smaller, more manageable pieces. This is especially useful when working with datasets that don't fit into memory.
import zarr
# Create a Zarr array with chunking
z = zarr.open('data.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100), dtype='float32')
z[:] = np.random.rand(1000, 1000) #adding some data
Here, the array is split into chunks of size 100x100. You can control the chunk size depending on your hardware capabilities and the nature of your data.
Now we can retrieve a setion of data using slicing. Retriving the data will load all those chuncks which contains the desired set of data.
z = zarr.open('data.zarr', mode='r')
# Load only a portion of the data
subset = z[200:300, 400:500]
For seeking into zarr and see how efficient it is we can load data by different method and can check loading time for different methods.
The below code generate a 2D array of shape 5000x5000 and store it in csv, numpy and zarr.
import numpy as np
import pandas as pd
import zarr
data = np.random.rand(5000, 5000)
# File paths
csv_file = "large_data.csv"
npy_file = "large_data.npy"
zarr_file="large_data.zarr"
# Save data in different formats
pd.DataFrame(data).to_csv(csv_file, index=False, header=False) # CSV format
np.save(npy_file, data) # NumPy format
zarr_f = zarr.create(shape=(5000, 5000), chunks=(100, 100), dtype='float64', store="large_data.zarr")
zarr_f[:] = data
print("Files saved successfully.")
Once all the files are saved successfully. We will retrive 100 rows from each format and check the retrival time. For this we will be using builtin time library of python.
import numpy as np
import pandas as pd
import zarr
import time
csv_file = "large_data.csv"
npy_file = "large_data.npy"
zarr_file="large_data.zarr"
def measure_time(load_func, desc):
start_time = time.time()
subset = load_func() # Load 100 rows
end_time = time.time()
print(f"{desc} retrieval time: {end_time - start_time:.4f} seconds")
# Measure retrieval times
print("\nRetrieving 100 rows from each format:")
measure_time(lambda: pd.read_csv(csv_file, header=None).values[:100, :], "CSV")
measure_time(lambda: np.load(npy_file)[:100, :], "NumPy")
measure_time(lambda: zarr.open(zarr_file, mode='r')[:100, :], "Zarr")
In my computer the output is like this
Retrieving 100 rows from each format:
CSV retrieval time: 8.6712 seconds
NumPy retrieval time: 0.1489 seconds
Zarr retrieval time: 0.1179 seconds
We can clearly see that Zarr takes the least time for retriving data. In this example we have taken a data of size 5000x5000. In practical use the data set much much larger. In that case zarr makes a big difference.
Zarr optimizes data storage through chunk-based compression, enabling efficient handling of large datasets. Here's how it works with practical examples:
Default Blosc Compression (LZ4 + Shuffle):
import zarr
import numpy as np
# Create 10,000x10,000 array with Blosc compression
arr = zarr.open('compressed.zarr', mode='w',
shape=(10000, 10000), chunks=(1000, 1000),
dtype='int32', compressor=zarr.Blosc())
arr[:] = np.random.randint(0, 100, (10000, 10000)) # 381.5MB raw → 3.1MB stored [1][4]
Key Features:
# Zstandard compression with bit shuffle
high_comp = zarr.open('zstd.zarr', mode='w', compressor=zarr.Blosc(cname='zstd', clevel=5, shuffle=2))
# Disabled compression for fast access
no_comp = zarr.open('raw.zarr', mode='w', compressor=None) # Stores 38.1MB raw → 38.1MB
The first example uses Zstandard (Zstd) compression with Bit Shuffle, zstd is chosen as the compression algorithm (cname='zstd'). Compression level is set to 5 (clevel=5), balancing speed and compression ratio and bit Shuffle (shuffle=2) is applied to improve compression efficiency by rearranging bits for better pattern recognition.
The second example disables compression (compressor=None), resulting in raw data storage for faster access. This is useful when storage space is not a concern but quick retrieval is needed.
comp_levels = {
'fast': zarr.Blosc(clevel=1), # Low compression, high speed
'balanced': zarr.Blosc(clevel=5),# Default
'max': zarr.Blosc(clevel=9) # High compression, slower
}
This dictionary defines different compression levels using the Blosc compressor:
(clevel=1)
→ Low compression, but high-speed read/write performance.(clevel=5)
→ A middle-ground option, providing moderate compression and speed.(clevel=9)
→ Maximum compression for reducing file size, but at the cost of slower performance.Supported Algorithms:
Compressor | Speed | Ratio | Use Case |
---|---|---|---|
Blosc+LZ4 | ★★★★★ | 20:1 | General purpose |
Zstandard | ★★★★☆ | 120:1 | Archival storage |
zlib | ★★☆☆☆ | 3:1 | Compatibility |
LZMA | ★☆☆☆☆ | 200:1 | Maximum compression |
If you don't specify a compressor, by default Zarr uses the Zstandard compressor.
To disable compression, set compressors=None
when creating an array, e.g.:
z = zarr.create_array(store='data/example-8.zarr', shape=(100000000,), chunks=(1000000,), dtype='int32', compressors=None)
Parallelism boosts data processing efficiency by dividing tasks and executing them concurrently. This is crucial for handling large datasets, taking maximum advantage of multi-core CPUs or distributed systems.
import zarr
import dask.array as da
# Create a Dask array backed by a Zarr store
zarr_store = zarr.open('data.zarr', mode='w', shape=(10000, 10000), dtype='f4', chunks=(1000, 1000))
dask_array = da.from_array(zarr_store, chunks=(1000, 1000))
# Perform a parallel operation
result = dask_array.sum().compute()
Using Dask, you can perform parallel operations on a Zarr-backed array, leveraging multi-core CPUs or distributed systems for large-scale computation.
In general performing operations to data in RAM is faster. But when we have a large set of data (may be in GB's or PB's), which is larger than RAM typical storage limit we will need to store the data in permanent memory and fetch a part of data that will fit into memeory and perform operation on them.
We can store and retrive the data from a csv file or we can use zarr. While csv file is easier to visulise data in a excel sheet it does not support multiproccesing. So zarr is a better option for storing large set of data. Zarr along with 'Dask' can accelerate task.
The below code shows the importance of zarr in processing large data.
import numpy as np
import pandas as pd
import zarr
import dask.array as da
import time
import os
size = (5000, 5000)
csv_filename = "data_2.csv"
zarr_store_path = "data_2.zarr"
data = np.random.rand(*size)
# Save as CSV (slow traditional format)
pd.DataFrame(data).to_csv(csv_filename, index=False, header=False)
# Save as Zarr (optimized format)
zarr_store = zarr.open(zarr_store_path, mode='w', shape=size, chunks=(1000, 1000), dtype='float64')
zarr_store[:] = data
del data
#2. Read & Process Data Using CSV + Pandas
start_time = time.time()
csv_data = pd.read_csv(csv_filename, header=None).values
csv_result = np.sqrt(csv_data) #Operation is performed one by one
end_time = time.time()
csv_time = end_time - start_time
print(f"Time taken using CSV + Pandas: {csv_time:.4f} seconds")
# Free up memory
del csv_data, csv_result
#3. Read & Process Data Using Zarr + Dask
start_time = time.time()
dask_array = da.from_zarr(zarr_store_path)
zarr_result = da.sqrt(dask_array).compute() #Performing operation in parallel
end_time = time.time()
zarr_time = end_time - start_time
print(f"Time taken using Zarr + Dask: {zarr_time:.4f} seconds")
# Cleanup files
os.remove(csv_filename)
zarr.consolidate_metadata(zarr_store_path)
if csv_time > zarr_time:
print("\nZarr + Dask is faster! 🚀")
else:
print("\nCSV + Pandas is faster. (Unlikely for large data!)")
The above code a data of size 5000x5000 is created and stored in csv and zarr format. Then data is retrived from csv file to calculate the square root of each data element and the time for performing the operation is noted. Similarly time of performing the same operation using zarr and dask is noted. Running the above code we can see the difference zarr makes while processing large data.
Below is the output of the code in my machine
Time taken using CSV + Pandas: 7.5666 seconds
Time taken using Zarr + Dask: 0.6108 seconds
Zarr + Dask is faster! 🚀
Imagine you are at a tradional library that stores a huge collection of books. Unfortunately there is only one librarian. If you ask the librarian how many books the library has? the librarian counts them one by one, which is slow. Since only one person is working, it takes a long time.
Now imagine you are at a library in which book shelfs are categorised into sections(chunks). Each section is taken care by a assistant, now counting books became more easier, each assistant will count the number of books in their section and report them to librarian. Librarian will sum them and tell you the number of books. Since multiple persons are working together (Multiple threads of processor working in parallel) you will get the answer faster.
Zarr helps to store large set of data in different chunks (files). Dask take each file and parallely process them.
Zarr is a chunked, compressed, and scalable storage format for multi-dimensional arrays. It supports different storage backends, such as local files, cloud storage (Amazon S3, Google Cloud Storage), and databases.
Zarr organizes data into chunks and stores them separately. Each chunk is compressed and stored in a key-value store (local file system, cloud storage, etc.). Here's how it works:
.zarray
, .zgroup
, .zattrs
) to describe the dataset.
Example: Using Zarr with Amazon S3 storage
import zarr
# Define a storage backend (Amazon S3)
store = zarr.storage.FSStore('s3://bucket-name/data.zarr')
# Create a Zarr array with chunking
arr = zarr.open(store, mode='w', shape=(10000, 10000), dtype='f4', chunks=(1000, 1000))
# Write some data
arr[:500, :500] = 42.0 # Assign value to a chunk
# Read the data back
print(arr[:500, :500])
This example shows how to use Zarr with Amazon S3 as a storage backend, but you can easily switch to other cloud providers or local file systems.
Stores information about the array shape, data type, chunk size, and compression.
{
"shape": [10000, 10000],
"chunks": [1000, 1000],
"dtype": "f4",
"compressor": {"id": "blosc", "clevel": 5}
}
Each chunk (e.g., 0.0, 0.1) is stored as a separate file/object. For example, chunk (0,0) might be stored as:
s3://bucket-name/data.zarr/0.0
s3://bucket-name/data.zarr/0.1
To store data locally instead of S3:
store = zarr.DirectoryStore('data.zarr') # Local folder storage
arr = zarr.open(store, mode='w', shape=(10000, 10000), dtype='f4', chunks=(1000, 1000))
To store data locally instead of using cloud storage, Zarr provides the DirectoryStore backend. This saves the array as a structured directory on disk, where each chunk is stored as a separate file. Metadata files (.zarray
, .zattrs
) are also saved in the directory, making it easy to access and manage. This method is useful for local processing, debugging, or when cloud storage is not needed.
Creating Zarr arrays is simple and efficient. You can create arrays from scratch, load them from existing data, or use other libraries like Dask to work with large datasets.
import zarr
# Create a new Zarr array
arr = zarr.create(shape=(1000, 1000), dtype='f4', chunks=(100, 100)) #This will create a 1000x1000 array with 100x100 chunks, suitable for efficient access and modification.
arr2 = zarr.array([1, 2, 3, 4, 5]) #Creating an Array from a List
arr3 = zarr.empty((5, 5), dtype='f4') #Creates an empty Zarr array of shape (5,5) with float32 data type
arr4 = zarr.zeros((4, 4), dtype='i4') #Creating a Zero-Filled Array
arr5 = zarr.ones((3, 3), dtype='f8') #Creating an Array with Ones
arr6 = zarr.full((2, 2), fill_value=7, dtype='i4') #Creating a Persistent Zarr Array on Disk
arr7 = zarr.open('resizable.zarr', mode='w', shape=(2, 2), dtype='i4')
z.resize((4, 2)) #Creating a Resizable Array that could expand later
Zarr provides an easy-to-use API for reading and writing arrays to disk, including support for compression and chunking.
import zarr
# Open a Zarr array for writing
arr = zarr.open('data.zarr', mode='w', shape=(1000, 1000), dtype='f4', chunks=(100, 100))
# Write data to the array
arr[:] = 42
Here, data is written to all elements of the array with the value 42.
Zarr-Python 3 establishes itself as the premier solution for modern data challenges, offering:
Zarr-Python 3's advancements enable breakthrough implementations in 6 critical domains: