xarray_beam.Rechunk

class xarray_beam.Rechunk(dim_sizes, source_chunks, target_chunks, itemsize, min_mem=None, max_mem=1073741824)

Rechunk to an arbitrary new chunking scheme with bounded memory usage.

The approach taken here builds on Rechunker [1], but differs in two key ways:

  1. It is performed via collective Beam operations, instead of writing intermediates arrays to disk.

  2. It is performed collectively on full xarray.Dataset objects, instead of NumPy arrays.

[1] rechunker.readthedocs.io

Parameters:
  • dim_sizes (Mapping[str, int]) –

  • source_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) –

  • target_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) –

  • itemsize (int) –

  • min_mem (Optional[int]) –

  • max_mem (int) –

__init__(dim_sizes, source_chunks, target_chunks, itemsize, min_mem=None, max_mem=1073741824)

Initialize Rechunk().

Parameters:
  • dim_sizes (Mapping[str, int]) – size of the full (combined) dataset of all chunks.

  • source_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) – sizes of source chunks. Missing keys or values equal to -1 indicate “non-chunked” dimensions.

  • target_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) – sizes of target chunks, like source_keys. Keys must exactly match those found in source_chunks.

  • itemsize (int) – approximate number of bytes per xarray.Dataset element, after indexing out by all dimensions, e.g., 4 * len(dataset) for float32 data or roughly dataset.nbytes / np.prod(dataset.sizes).

  • min_mem (Optional[int]) – minimum memory that a single intermediate chunk must consume.

  • max_mem (int) – maximum memory that a single intermediate chunk may consume.

Methods

__init__(dim_sizes, source_chunks, ...[, ...])

Initialize Rechunk().

annotations()

default_label()

default_type_hints()

display_data()

Returns the display data associated to a pipeline component.

expand(pcoll)

from_runner_api(proto, context)

get_resource_hints()

get_type_hints()

Gets and/or initializes type hints for this object.

get_windowing(inputs)

Returns the window function to be associated with transform's output.

infer_output_type(unused_input_type)

register_urn(urn, parameter_type[, constructor])

runner_api_requires_keyed_input()

to_runner_api(context[, has_parts])

to_runner_api_parameter(unused_context)

to_runner_api_pickled(unused_context)

type_check_inputs(pvalueish)

type_check_inputs_or_outputs(pvalueish, ...)

type_check_outputs(pvalueish)

with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

with_resource_hints(**kwargs)

Adds resource hints to the PTransform.

Attributes

label

pipeline

side_inputs