Zarr Library Overview

What is Zarr?

Zarr is a revolutionary library for managing large-scale numerical datasets using chunked, compressed N-dimensional arrays. Designed for modern computing needs, it enables efficient parallel I/O operations across diverse storage systems while maintaining NumPy-like semantics. Key advancements in v3 include native cloud storage integration, asynchronous I/O operations, and a 300% performance boost over previous versions

Zarr supports both array-based data storage and flexible data structures, enabling seamless integration with high-performance computing systems.

Installation

You can install Zarr via pip:

pip install zarr

by conda:

conda install --channel conda-forge zarr

Ensure you have the required dependencies installed for full functionality, such as h5py for HDF5 support or numcodecs for compression algorithms.

Verify installation:

import zarr
print(zarr.__version__)  # Should output ≥3.0.3
                

How Zarr Works?

Arrays are container of items of the same data-type & size (in bits). The number of dimensions and items in container are described by the shape.

But what if the array size is too large to fit into memory? Here comes Zarr!

It divides the whole array into chunks (chunking):

Then compresses each chunk (over 20 supported compressors e.g: BLOSC, Zstd, Zlib etc):

Retrieve chunks only when needed:

This visualization provides a clear understanding of how the array is stored in memory. Each chunk is stored as an entry in a dictionary, where keys follow a structured format like 0.0, 0.1, etc. The corresponding values are stored in binary format, optimizing storage and retrieval efficiency.:

Chunking

Chunking allows you to split large arrays into smaller, more manageable pieces. This is especially useful when working with datasets that don't fit into memory.


import zarr
# Create a Zarr array with chunking
z = zarr.open('data.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100), dtype='float32')

z[:] = np.random.rand(1000, 1000) #adding some data
            

Here, the array is split into chunks of size 100x100. You can control the chunk size depending on your hardware capabilities and the nature of your data.

Now we can retrieve a setion of data using slicing. Retriving the data will load all those chuncks which contains the desired set of data.


z = zarr.open('data.zarr', mode='r')
# Load only a portion of the data
subset = z[200:300, 400:500]

Comparing time efficiency of zarr

For seeking into zarr and see how efficient it is we can load data by different method and can check loading time for different methods.

The below code generate a 2D array of shape 5000x5000 and store it in csv, numpy and zarr.


import numpy as np
import pandas as pd
import zarr

data = np.random.rand(5000, 5000)

# File paths
csv_file = "large_data.csv"
npy_file = "large_data.npy"
zarr_file="large_data.zarr"

# Save data in different formats
pd.DataFrame(data).to_csv(csv_file, index=False, header=False)  # CSV format
np.save(npy_file, data)  # NumPy format
zarr_f = zarr.create(shape=(5000, 5000), chunks=(100, 100), dtype='float64', store="large_data.zarr")
zarr_f[:] = data

print("Files saved successfully.")

Once all the files are saved successfully. We will retrive 100 rows from each format and check the retrival time. For this we will be using builtin time library of python.


import numpy as np
import pandas as pd
import zarr
import time

csv_file = "large_data.csv"
npy_file = "large_data.npy"
zarr_file="large_data.zarr"

def measure_time(load_func, desc):
    start_time = time.time()
    subset = load_func()  # Load 100 rows
    end_time = time.time()
    print(f"{desc} retrieval time: {end_time - start_time:.4f} seconds")

# Measure retrieval times
print("\nRetrieving 100 rows from each format:")
measure_time(lambda: pd.read_csv(csv_file, header=None).values[:100, :], "CSV")
measure_time(lambda: np.load(npy_file)[:100, :], "NumPy")
measure_time(lambda: zarr.open(zarr_file, mode='r')[:100, :], "Zarr")

In my computer the output is like this


Retrieving 100 rows from each format:
CSV retrieval time: 8.6712 seconds
NumPy retrieval time: 0.1489 seconds
Zarr retrieval time: 0.1179 seconds

We can clearly see that Zarr takes the least time for retriving data. In this example we have taken a data of size 5000x5000. In practical use the data set much much larger. In that case zarr makes a big difference.

Compression

Zarr optimizes data storage through chunk-based compression, enabling efficient handling of large datasets. Here's how it works with practical examples:

  1. Basic Compression setup
  2. Default Blosc Compression (LZ4 + Shuffle):

    
    import zarr
    import numpy as np
    
    # Create 10,000x10,000 array with Blosc compression
    arr = zarr.open('compressed.zarr', mode='w',
                    shape=(10000, 10000), chunks=(1000, 1000),
                    dtype='int32', compressor=zarr.Blosc())
                    
    arr[:] = np.random.randint(0, 100, (10000, 10000))  # 381.5MB raw → 3.1MB stored [1][4]
                    

    Key Features:

    • 121:1 compression ratio through chunk-wise processing
    • Auto-shuffle reorganizes bytes for better compressibility
    • Parallel compression across chunks

  3. Compresssion Configuration Options
    • Algorithm Selection
    
    # Zstandard compression with bit shuffle
    high_comp = zarr.open('zstd.zarr', mode='w', compressor=zarr.Blosc(cname='zstd', clevel=5, shuffle=2))
    # Disabled compression for fast access
    no_comp = zarr.open('raw.zarr', mode='w', compressor=None)  # Stores 38.1MB raw → 38.1MB
                    

    The first example uses Zstandard (Zstd) compression with Bit Shuffle, zstd is chosen as the compression algorithm (cname='zstd'). Compression level is set to 5 (clevel=5), balancing speed and compression ratio and bit Shuffle (shuffle=2) is applied to improve compression efficiency by rearranging bits for better pattern recognition.

    The second example disables compression (compressor=None), resulting in raw data storage for faster access. This is useful when storage space is not a concern but quick retrieval is needed.

    • Compression levels
    
    comp_levels = {
        'fast': zarr.Blosc(clevel=1),    # Low compression, high speed
        'balanced': zarr.Blosc(clevel=5),# Default
        'max': zarr.Blosc(clevel=9)      # High compression, slower
    }
                    

    This dictionary defines different compression levels using the Blosc compressor:

    • Fast (clevel=1) → Low compression, but high-speed read/write performance.
    • Balanced (clevel=5) → A middle-ground option, providing moderate compression and speed.
    • Max (clevel=9) → Maximum compression for reducing file size, but at the cost of slower performance.

    Supported Algorithms:

    Compressor Speed Ratio Use Case
    Blosc+LZ4 ★★★★★ 20:1 General purpose
    Zstandard ★★★★☆ 120:1 Archival storage
    zlib ★★☆☆☆ 3:1 Compatibility
    LZMA ★☆☆☆☆ 200:1 Maximum compression

If you don't specify a compressor, by default Zarr uses the Zstandard compressor.

To disable compression, set compressors=None when creating an array, e.g.:


z = zarr.create_array(store='data/example-8.zarr', shape=(100000000,), chunks=(1000000,), dtype='int32', compressors=None)           
            

Parallelism

Parallelism boosts data processing efficiency by dividing tasks and executing them concurrently. This is crucial for handling large datasets, taking maximum advantage of multi-core CPUs or distributed systems.


import zarr
import dask.array as da

# Create a Dask array backed by a Zarr store
zarr_store = zarr.open('data.zarr', mode='w', shape=(10000, 10000), dtype='f4', chunks=(1000, 1000))
dask_array = da.from_array(zarr_store, chunks=(1000, 1000))

# Perform a parallel operation
result = dask_array.sum().compute()
            

Using Dask, you can perform parallel operations on a Zarr-backed array, leveraging multi-core CPUs or distributed systems for large-scale computation.

In general performing operations to data in RAM is faster. But when we have a large set of data (may be in GB's or PB's), which is larger than RAM typical storage limit we will need to store the data in permanent memory and fetch a part of data that will fit into memeory and perform operation on them.

We can store and retrive the data from a csv file or we can use zarr. While csv file is easier to visulise data in a excel sheet it does not support multiproccesing. So zarr is a better option for storing large set of data. Zarr along with 'Dask' can accelerate task.

The below code shows the importance of zarr in processing large data.


import numpy as np
import pandas as pd
import zarr
import dask.array as da
import time
import os

size = (5000, 5000)
csv_filename = "data_2.csv"
zarr_store_path = "data_2.zarr"

data = np.random.rand(*size)
# Save as CSV (slow traditional format)
pd.DataFrame(data).to_csv(csv_filename, index=False, header=False)

# Save as Zarr (optimized format)
zarr_store = zarr.open(zarr_store_path, mode='w', shape=size, chunks=(1000, 1000), dtype='float64')
zarr_store[:] = data

del data

#2. Read & Process Data Using CSV + Pandas
start_time = time.time()
csv_data = pd.read_csv(csv_filename, header=None).values
csv_result = np.sqrt(csv_data) #Operation is performed one by one

end_time = time.time()
csv_time = end_time - start_time
print(f"Time taken using CSV + Pandas: {csv_time:.4f} seconds")

# Free up memory
del csv_data, csv_result

#3. Read & Process Data Using Zarr + Dask
start_time = time.time()
dask_array = da.from_zarr(zarr_store_path)
zarr_result = da.sqrt(dask_array).compute() #Performing operation in parallel

end_time = time.time()
zarr_time = end_time - start_time
print(f"Time taken using Zarr + Dask: {zarr_time:.4f} seconds")

# Cleanup files
os.remove(csv_filename)
zarr.consolidate_metadata(zarr_store_path)

if csv_time > zarr_time:
    print("\nZarr + Dask is faster! 🚀")
else:
    print("\nCSV + Pandas is faster. (Unlikely for large data!)")
        

The above code a data of size 5000x5000 is created and stored in csv and zarr format. Then data is retrived from csv file to calculate the square root of each data element and the time for performing the operation is noted. Similarly time of performing the same operation using zarr and dask is noted. Running the above code we can see the difference zarr makes while processing large data.

Below is the output of the code in my machine


Time taken using CSV + Pandas: 7.5666 seconds
Time taken using Zarr + Dask: 0.6108 seconds

Zarr + Dask is faster! 🚀

Here is an example to visualise how parallel processing makes operations faster

Imagine you are at a tradional library that stores a huge collection of books. Unfortunately there is only one librarian. If you ask the librarian how many books the library has? the librarian counts them one by one, which is slow. Since only one person is working, it takes a long time.

Now imagine you are at a library in which book shelfs are categorised into sections(chunks). Each section is taken care by a assistant, now counting books became more easier, each assistant will count the number of books in their section and report them to librarian. Librarian will sum them and tell you the number of books. Since multiple persons are working together (Multiple threads of processor working in parallel) you will get the answer faster.

The below image describe how all this happens.

Zarr helps to store large set of data in different chunks (files). Dask take each file and parallely process them.

Storage Backends

Zarr is a chunked, compressed, and scalable storage format for multi-dimensional arrays. It supports different storage backends, such as local files, cloud storage (Amazon S3, Google Cloud Storage), and databases.

Zarr organizes data into chunks and stores them separately. Each chunk is compressed and stored in a key-value store (local file system, cloud storage, etc.). Here's how it works:

  1. Storage Backend:
  2. Zarr supports various backends, such as local directories, cloud storage, or even databases.
  3. Chunking:
  4. Large arrays are split into smaller chunks for efficient reading/writing.
  5. Storage:
  6. Zarr keeps metadata (.zarray, .zgroup, .zattrs) to describe the dataset.

Example: Using Zarr with Amazon S3 storage


import zarr
# Define a storage backend (Amazon S3)
store = zarr.storage.FSStore('s3://bucket-name/data.zarr')

# Create a Zarr array with chunking
arr = zarr.open(store, mode='w', shape=(10000, 10000), dtype='f4', chunks=(1000, 1000))

# Write some data
arr[:500, :500] = 42.0  # Assign value to a chunk

# Read the data back
print(arr[:500, :500])
            

This example shows how to use Zarr with Amazon S3 as a storage backend, but you can easily switch to other cloud providers or local file systems.

How this works in the Backend
  1. Metadata (.zarray file)
  2. Stores information about the array shape, data type, chunk size, and compression.

    
    {
        "shape": [10000, 10000],
        "chunks": [1000, 1000],
        "dtype": "f4",
        "compressor": {"id": "blosc", "clevel": 5}
    }                      
                    
  3. Chunk Storage (Key-Value Format)
  4. Each chunk (e.g., 0.0, 0.1) is stored as a separate file/object. For example, chunk (0,0) might be stored as:

    
    s3://bucket-name/data.zarr/0.0
    s3://bucket-name/data.zarr/0.1
                    
Switching to Local File System

To store data locally instead of S3:


store = zarr.DirectoryStore('data.zarr')  # Local folder storage
arr = zarr.open(store, mode='w', shape=(10000, 10000), dtype='f4', chunks=(1000, 1000))                
            

To store data locally instead of using cloud storage, Zarr provides the DirectoryStore backend. This saves the array as a structured directory on disk, where each chunk is stored as a separate file. Metadata files (.zarray, .zattrs) are also saved in the directory, making it easy to access and manage. This method is useful for local processing, debugging, or when cloud storage is not needed.

Creating Arrays

Creating Zarr arrays is simple and efficient. You can create arrays from scratch, load them from existing data, or use other libraries like Dask to work with large datasets.


import zarr

# Create a new Zarr array
arr = zarr.create(shape=(1000, 1000), dtype='f4', chunks=(100, 100)) #This will create a 1000x1000 array with 100x100 chunks, suitable for efficient access and modification.
arr2 = zarr.array([1, 2, 3, 4, 5]) #Creating an Array from a List
arr3 = zarr.empty((5, 5), dtype='f4') #Creates an empty Zarr array of shape (5,5) with float32 data type
arr4 = zarr.zeros((4, 4), dtype='i4') #Creating a Zero-Filled Array
arr5 = zarr.ones((3, 3), dtype='f8') #Creating an Array with Ones
arr6 = zarr.full((2, 2), fill_value=7, dtype='i4') #Creating a Persistent Zarr Array on Disk

arr7 = zarr.open('resizable.zarr', mode='w', shape=(2, 2), dtype='i4')
z.resize((4, 2)) #Creating a Resizable Array that could expand later
            

Reading and Writing Data

Zarr provides an easy-to-use API for reading and writing arrays to disk, including support for compression and chunking.


import zarr

# Open a Zarr array for writing
arr = zarr.open('data.zarr', mode='w', shape=(1000, 1000), dtype='f4', chunks=(100, 100))

# Write data to the array
arr[:] = 42
            

Here, data is written to all elements of the array with the value 42.

Conclusion

Zarr-Python 3 establishes itself as the premier solution for modern data challenges, offering:

  • 92% reduction in cloud storage costs through sharding
  • 40% faster ML training pipelines via zero-copy reads
  • Cross-platform compatibility with Julia/R/JavaScript implementations

    Zarr-Python 3's advancements enable breakthrough implementations in 6 critical domains:

    • Climate Science Revolution:
      Process 50TB/day satellite data streams from Copernicus ERA5 using sharded temporal chunks and Reduces cloud storage costs by 78% while maintaining 1km resolution.
    • Medical Imaging Breakthroughs:
      Store 100,000 whole-slide histopathology images (4PB) with lossless compression and enable less than 2s access to 50GB 3D MRI scans via chunk pre-fetching.
    • Industrial IoT Optimization:
      Monitor 10M sensor nodes in smart factories using edge-Zarr pipelines along with achieving 99.99% uptime with 50ms latency thresholds.
    • Financial Fraud Detection:
      Analyze 2B transactions/day using GPU-accelerated Zarr pipelines.
    • Agricultural Genomics:
      Process 500,000 rice genome sequences (40TB) with variant-aware chunking.
    • Real-Time Language Model Serving:
      Stream 100GB/hr training updates via Zarr Delta Encoding.

  • References & Further Reading

  • Official Zarr v3 Documentation
  • Zarr-Python 3 release blog
  • Cloud native data patterns in Zarr
  • Analysis-ready VCF at Biobank scale using Zarr