xarray_beam.ChunksToZarr

class xarray_beam.ChunksToZarr(store, template=None, zarr_chunks=None, num_threads=None)

Write keyed chunks to a Zarr store in parallel.

Parameters:
  • store (WritableStore) –

  • template (Union[xarray.Dataset, beam.pvalue.AsSingleton, None]) –

  • zarr_chunks (Optional[Mapping[str, int]]) –

  • num_threads (Optional[int]) –

__init__(store, template=None, zarr_chunks=None, num_threads=None)

Initialize ChunksToZarr.

Parameters:
  • store (Union[str, MutableMapping[str, bytes]]) – a string corresponding to a Zarr path or an existing Zarr store.

  • template (Optional[Union[Dataset, AsSingleton]]) –

    an argument providing a lazy xarray.Dataset already chunked using Dask (e.g., as created by xarray_beam.make_template) that matches the structure of the virtual combined dataset corresponding to the chunks fed into this PTransform. One or more variables are expected to be “chunked” with Dask, and will only have their metadata written to Zarr without array values. Three types of inputs are supported:

    1. If template is an xarray.Dataset, the Zarr store is setup eagerly.

    2. If template is a beam.pvalue.AsSingleton object representing the result of a prior step in a Beam pipeline, the Zarr store is setup as part of the pipeline.

    3. Finally, if template is None, then the structure of the desired Zarr store is discovered automatically by inspecting the inputs into ChunksToZarr. This is an easy option, but can be quite expensive/slow for large datasets – Beam runners will typically handle this by dumping a temporary copy of the complete dataset to disk. For best performance, supply the template explicitly (1 or 2).

  • zarr_chunks (Optional[Mapping[str, int]]) – chunking scheme to use for Zarr. If set, overrides the chunking scheme on already chunked arrays in template.

  • num_threads (Optional[int]) – the number of Dataset chunks to write in parallel per worker. More threads can increase throughput, but also increases memory usage and makes it harder for Beam runners to shard work. Note that each variable in a Dataset is already written in parallel, so this is most useful for Datasets with a small number of variables.

Methods

__init__(store[, template, zarr_chunks, ...])

Initialize ChunksToZarr.

annotations()

default_label()

default_type_hints()

display_data()

Returns the display data associated to a pipeline component.

expand(pcoll)

from_runner_api(proto, context)

get_resource_hints()

get_type_hints()

Gets and/or initializes type hints for this object.

get_windowing(inputs)

Returns the window function to be associated with transform's output.

infer_output_type(unused_input_type)

register_urn(urn, parameter_type[, constructor])

runner_api_requires_keyed_input()

to_runner_api(context[, has_parts])

to_runner_api_parameter(unused_context)

to_runner_api_pickled(unused_context)

type_check_inputs(pvalueish)

type_check_inputs_or_outputs(pvalueish, ...)

type_check_outputs(pvalueish)

with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

with_resource_hints(**kwargs)

Adds resource hints to the PTransform.

Attributes

label

pipeline

side_inputs