xarray_beam.DatasetToChunks

class xarray_beam.DatasetToChunks(dataset, chunks=None, split_vars=False, num_threads=None, shard_keys_threshold=200000)

Split one or more xarray.Datasets into keyed chunks.

Parameters:
  • dataset (DatasetOrDatasets) –

  • chunks (Optional[Mapping[str, Union[int, Tuple[int, ...]]]]) –

  • split_vars (bool) –

  • num_threads (Optional[int]) –

  • shard_keys_threshold (int) –

__init__(dataset, chunks=None, split_vars=False, num_threads=None, shard_keys_threshold=200000)

Initialize DatasetToChunks.

Parameters:
  • dataset (DatasetOrDatasets) – dataset or datasets to split into (Key, xarray.Dataset) or (Key, [xarray.Dataset, …]) pairs.

  • chunks (Optional[Mapping[str, Union[int, Tuple[int, ...]]]]) – optional chunking scheme. Required if the dataset is not already chunked. If the dataset is already chunked with Dask, chunks takes precedence over the existing chunks.

  • split_vars (bool) – whether to split the dataset into separate records for each data variable or to keep all data variables together. This is recommended if you don’t need to perform joint operations on different dataset variables and individual variable chunks are sufficiently large.

  • num_threads (Optional[int]) – optional number of Dataset chunks to load in parallel per worker. More threads can increase throughput, but also increases memory usage and makes it harder for Beam runners to shard work. Note that each variable in a Dataset is already loaded in parallel, so this is most useful for Datasets with a small number of variables or when using split_vars=True.

  • shard_keys_threshold (int) – threshold at which to compute keys on Beam workers, rather than only on the host process. This is important for scaling pipelines to millions of tasks.

Methods

__init__(dataset[, chunks, split_vars, ...])

Initialize DatasetToChunks.

annotations()

default_label()

default_type_hints()

display_data()

Returns the display data associated to a pipeline component.

expand(pcoll)

from_runner_api(proto, context)

get_resource_hints()

get_type_hints()

Gets and/or initializes type hints for this object.

get_windowing(inputs)

Returns the window function to be associated with transform's output.

infer_output_type(unused_input_type)

register_urn(urn, parameter_type[, constructor])

runner_api_requires_keyed_input()

to_runner_api(context[, has_parts])

to_runner_api_parameter(unused_context)

to_runner_api_pickled(unused_context)

type_check_inputs(pvalueish)

type_check_inputs_or_outputs(pvalueish, ...)

type_check_outputs(pvalueish)

with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

with_resource_hints(**kwargs)

Adds resource hints to the PTransform.

Attributes

label

pipeline

side_inputs