Rewrite DataTree.to_netcdf and support netCDF4 in-memory #10624

shoyer · 2025-08-11T19:43:34Z

This PR includes a handful of significant changes:

It refactors the internal structure of DataTree.to_netcdf() and DataTree.to_zarr() to use lower level interfaces, rather than calling Dataset methods. This allows for properly supporting compute=False (and likely various other improvements).
Reading and writing in-memory data with netCDF4-python is now supported, including DataTree.
I've added a new user-facing load_datatree function, for consistentcy with load_dataset and load_dataarray.
The engine argument in DataTree.to_netcdf() is now set consistently with Dataset.to_netcdf(), preferring netcdf4 to h5netcdf.
Calling Dataset.to_netcdf() without a target now always returns a memoryview object, including in the case where engine='scipy' is used (which currently returns bytes). This is a breaking change, rather than merely issuing a warning as is done in Support for DataTree.to_netcdf to write to a file-like object or bytes #10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because netcdf4 is preferred to the scipy backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with bytes()).

Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

shoyer · 2025-08-11T19:51:54Z

It refactors the internal structure of DataTree.to_netcdf() and DataTree.to_zarr() to use lower level interfaces, rather than calling Dataset methods. This allows for properly supporting compute=False (and likely various other improvements).

I am thinking I might try to split this into a separate PR, because it's unrelated to the netCDF4 in-memory changes.

This PR includes a handful of significant changes: 1. It refactors the internal structure of `DataTree.to_netcdf()` and `DataTree.to_zarr()` to use lower level interfaces, rather than calling `Dataset` methods. This allows for properly supporting `compute=False` (and likely various other improvements). 2. Reading and writing in-memory data with netCDF4-python is now supported, including DataTree. 3. The `engine` argument in `DataTree.to_netcdf()` is now set consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to `h5netcdf`. 3. Calling `Dataset.to_netcdf()` without a target now always returns a `memoryview` object, *including* in the case where `engine='scipy'` is used (which currently returns `bytes`). This is a breaking change, rather than merely issuing a warning as is done in pydata#10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because `netcdf4` is preferred to the `scipy` backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with `bytes()`). mypy

github-actions bot added topic-backends topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class io labels Aug 11, 2025

shoyer mentioned this pull request Aug 11, 2025

DataTree.to_zarr() is very slow writing to high latency store #9455

Open

shoyer force-pushed the netcdf4-memory branch from 30a759c to 86563b9 Compare August 11, 2025 22:08

Refactor to_netcdf() and to_zarr() internals

cce5477

shoyer mentioned this pull request Aug 12, 2025

Support compute=False from DataTree.to_netcdf #10625

Merged

2 tasks

Merge branch 'main' into to_netcdf-internals

a3689d4

kmuehlbauer mentioned this pull request Aug 12, 2025

Sanitize unlimited_dims when writing to_netcdf #10608

Merged

3 tasks

shoyer added 12 commits August 12, 2025 14:54

Merge branch 'main' into to_netcdf-internals

f5ba356

Merge branch 'main' into to_netcdf-internals

c4d57f1

Fixes per review

22d3387

Clean up comments

e68e186

Merge branch 'main' into to_netcdf-internals

b925465

Fix type for to_netcdf()

6d8ae1e

Add test and whats-new for cross-group redundant computation

d9da973

Fix test failure on CI (and add a better test)

205fdbe

grammar

e82c334

Merge branch 'to_netcdf-internals' into netcdf4-memory

24d3552

Tweaks

ca5feca

Merge branch 'main' into netcdf4-memory

1be418e

shoyer mentioned this pull request Aug 19, 2025

Improve consistency of default engine and return memoryview instead of bytes from to_netcdf() #10656

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Rewrite DataTree.to_netcdf and support netCDF4 in-memory #10624

Rewrite DataTree.to_netcdf and support netCDF4 in-memory #10624

Uh oh!

shoyer commented Aug 11, 2025

Uh oh!

shoyer commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Rewrite DataTree.to_netcdf and support netCDF4 in-memory #10624

Are you sure you want to change the base?

Rewrite DataTree.to_netcdf and support netCDF4 in-memory #10624

Uh oh!

Conversation

shoyer commented Aug 11, 2025

Uh oh!

shoyer commented Aug 11, 2025

Uh oh!

Uh oh!