Skip to content

chore/handle numcodecs codecs #3376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Aug 13, 2025

This PR brings in all the codecs defined in numcodecs.zarr3. After this PR is merged, we can safely replace the numcodecs.zarr3 module with reexports from zarr python, or remove numcodecs.zarr3 entirely, thereby fixing our circular dependency problem.

This PR also changes the default config to ensure that the locally-defined codecs take priority over the same codec found in the numcodecs registry.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Aug 13, 2025
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 97.11538% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.73%. Comparing base (7970f4a) to head (efd6828).

Files with missing lines Patch % Lines
src/zarr/codecs/numcodecs/_codecs.py 96.44% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3376      +/-   ##
==========================================
+ Coverage   94.69%   94.73%   +0.04%     
==========================================
  Files          79       81       +2     
  Lines        9520     9708     +188     
==========================================
+ Hits         9015     9197     +182     
- Misses        505      511       +6     
Files with missing lines Coverage Δ
src/zarr/codecs/__init__.py 100.00% <100.00%> (ø)
src/zarr/codecs/blosc.py 87.87% <ø> (-0.25%) ⬇️
src/zarr/codecs/bytes.py 96.72% <ø> (-0.16%) ⬇️
src/zarr/codecs/crc32c_.py 96.87% <ø> (-0.19%) ⬇️
src/zarr/codecs/gzip.py 90.90% <ø> (-0.52%) ⬇️
src/zarr/codecs/numcodecs/__init__.py 100.00% <100.00%> (ø)
src/zarr/codecs/sharding.py 94.46% <100.00%> (-0.02%) ⬇️
src/zarr/codecs/transpose.py 89.09% <ø> (-0.39%) ⬇️
src/zarr/codecs/vlen_utf8.py 77.19% <ø> (-1.15%) ⬇️
src/zarr/codecs/zstd.py 90.00% <ø> (-0.39%) ⬇️
... and 3 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


def _encode(self, chunk_data: Buffer, prototype: BufferPrototype) -> Buffer:
encoded = self._codec.encode(chunk_data.as_array_like())
if isinstance(encoded, np.ndarray): # Required for checksum codecs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we know statically which are checksum codecs without the isinstance check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b., this was copy + pasted from numcodecs, but I think the answer is "no"

codec_name: str
codec_config: dict[str, JSON]

def __init_subclass__(cls, *, codec_name: str | None = None, **kwargs: Any) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would a codec definition look like without this magic? I'd be fine with repeating a few things if it meant we could avoid this (and IIUC some of the complexity in __repr__ and __init__ would go away too?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I have rebased #3332 off of this PR, I am 100% going to demagic these codecs in that effort.

Comment on lines 254 to 255
def __init__(self, **codec_config: JSON) -> None:
super().__init__(**codec_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this do?

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 17, 2025

> Can you explain the different cases here?

Spinning this question out into the main thread -- from me, the general answer to questions like this will be "no", since I am only copy+pasting stuff from numcodecs. I haven't spent too much time figuring out what this code is doing. I do think @normanrz and @TomNicholas might be able to answer some of these questions though.

@TomAugspurger
Copy link
Contributor

Ah, I didn't realize this was mostly from numcodecs. I think that moots most of my comments aside from where in the public API we put these.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 17, 2025

yeah I should have made more clear that this is nearly all directly copy + pasted from numcodecs.zarr3

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Aug 21, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

I think this is ready to go in (and it's necessary for #3332)

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

and important recent addition: I moved all of the invocations of register_codec to src/zarr/codecs/__init__.py. This ensures that all codecs get registered, regardless of whether they are part of zarr.codecs.__all__

@maxrjones
Copy link
Member

@d-v-b have you tested this with different numcodecs versions to make sure there's no unexpected issues with clobbering of codec registration?

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

@d-v-b have you tested this with different numcodecs versions to make sure there's no unexpected issues with clobbering of codec registration?

I'm don't think I expect any behavior to depend on numcodecs versions, happy to be corrected though. My understanding is that the codecs in numcodecs.zarr3 are not registered with the numcodecs registry, and instead are exposed via the entrypoints framework. That means we don't have to worry about anything interacting with numcodecs' own registry.

we have a test that checks our compatibility with these codecs, defined as dicts. In main these tests will pick up the codec class from numcodecs, but in this PR the version of the codec defined in zarr python is used instead. Is that kind of thing you are worried about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants