xarray_beam.Rechunk
- class xarray_beam.Rechunk(dim_sizes, source_chunks, target_chunks, itemsize, min_mem=None, max_mem=1073741824)
Rechunk to an arbitrary new chunking scheme with bounded memory usage.
The approach taken here builds on Rechunker [1], but differs in two key ways:
It is performed via collective Beam operations, instead of writing intermediates arrays to disk.
It is performed collectively on full xarray.Dataset objects, instead of NumPy arrays.
[1] rechunker.readthedocs.io
- Parameters:
dim_sizes (Mapping[str, int]) –
source_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) –
target_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) –
itemsize (int) –
min_mem (Optional[int]) –
max_mem (int) –
- __init__(dim_sizes, source_chunks, target_chunks, itemsize, min_mem=None, max_mem=1073741824)
Initialize Rechunk().
- Parameters:
dim_sizes (Mapping[str, int]) – size of the full (combined) dataset of all chunks.
source_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) – sizes of source chunks. Missing keys or values equal to -1 indicate “non-chunked” dimensions.
target_chunks (Mapping[str, Union[int, Tuple[int, ...]]]) – sizes of target chunks, like source_keys. Keys must exactly match those found in source_chunks.
itemsize (int) – approximate number of bytes per xarray.Dataset element, after indexing out by all dimensions, e.g., 4 * len(dataset) for float32 data or roughly dataset.nbytes / np.prod(dataset.sizes).
min_mem (Optional[int]) – minimum memory that a single intermediate chunk must consume.
max_mem (int) – maximum memory that a single intermediate chunk may consume.
Methods
__init__
(dim_sizes, source_chunks, ...[, ...])Initialize Rechunk().
annotations
()default_label
()default_type_hints
()display_data
()Returns the display data associated to a pipeline component.
expand
(pcoll)from_runner_api
(proto, context)get_resource_hints
()get_type_hints
()Gets and/or initializes type hints for this object.
get_windowing
(inputs)Returns the window function to be associated with transform's output.
infer_output_type
(unused_input_type)register_urn
(urn, parameter_type[, constructor])runner_api_requires_keyed_input
()to_runner_api
(context[, has_parts])to_runner_api_parameter
(unused_context)to_runner_api_pickled
(unused_context)type_check_inputs
(pvalueish)type_check_inputs_or_outputs
(pvalueish, ...)type_check_outputs
(pvalueish)with_input_types
(input_type_hint)Annotates the input type of a
PTransform
with a type-hint.with_output_types
(type_hint)Annotates the output type of a
PTransform
with a type-hint.with_resource_hints
(**kwargs)Adds resource hints to the
PTransform
.Attributes
label
pipeline
side_inputs