Skip to content

fix chunk/shard iteration #3299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Aug 22, 2025
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jul 25, 2025

In main there are some routines for iterating over the chunks of an array, but these routines do not distinguish between chunks and shards (i.e., stored objects) for arrays with sharding.

This PR adds a separate set of shard-specific iteration routines to complement our chunk-specific iteration routines. Various bugs related to iterating over chunks, when shards were the intended iteration target, have been fixed by these changes, notably bugs causing memory races when creating arrays via create_array (xref ##3169)

I think this supersedes #3217, @bojidar-bg I credited you as a co-author on one of these commits because your idea to change the iteration from chunks to shards was correct.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 25, 2025
return self.chunk_grid_shape

@property
def chunk_grid_shape(self) -> ChunkCoords:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new property called chunk_grid_shape because cdata_shape is ambiguous. cdata_shape is still around, but it now uses chunk_grid_shape

return tuple(starmap(ceildiv, zip(self.shape, self.chunks, strict=True)))

@property
def shard_grid_shape(self) -> ChunkCoords:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this complements chunk_grid_shape.

@bojidar-bg
Copy link
Contributor

Ooh, that looks like a much more complete implementation of what I did in that other PR! Kudos; I wouldn't have been able to push things this far ✨✨

@d-v-b
Copy link
Contributor Author

d-v-b commented Jul 25, 2025

Ooh, that looks like a much more complete implementation of what I did in that other PR! Kudos; I wouldn't have been able to push things this far ✨✨

It turned out to be more than I expected 😅 .


else:
msg = f"Indexing order {order} is not supported at this time." # type: ignore[unreachable]
raise NotImplementedError(msg)


def iter_regions(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new function for iterating over contiguous regions. When we support irregular chunking, we can overload the type of region_shape accordingly.

Copy link

codecov bot commented Jul 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.69%. Comparing base (9498336) to head (dfb4ee5).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3299      +/-   ##
==========================================
+ Coverage   94.62%   94.69%   +0.07%     
==========================================
  Files          79       79              
  Lines        9468     9520      +52     
==========================================
+ Hits         8959     9015      +56     
+ Misses        509      505       -4     
Files with missing lines Coverage Δ
src/zarr/core/array.py 97.44% <100.00%> (+0.33%) ⬆️
src/zarr/core/indexing.py 96.40% <100.00%> (+0.31%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b d-v-b requested a review from a team July 26, 2025 21:02
@dstansby
Copy link
Contributor

Just to make sure we're on the same page here, this is where we're heading in my mind, as something that can go in the user guide:

In zarr-python, a chunk of data represents a single file in storage. There is one exception: when a single sharding codec is part of the codecs. In this case a shard represents a single file in storage, and a chunk represents a unit of data the same size or smaller that can be read independently of other chunks in the shard. This distinction is made to allow zarr-python to optimize the creation and processing of sharded arrays.

@dstansby
Copy link
Contributor

I think my earlier comments still needs addressing here:

It is a bit confusing that .shards now returns None with just chunks, where all the other shard releated properties return the equivalent chunk values. I would suggest returning self.chunks from .shards if there's no sharding to match the other new properties, and adding a new .is_sharded: bool property to determine whether shards are being used (previously one would have done self.shards is None).

and

I like the concept that even when shards are not set, the shard properties still return the chunk values, but since this is a new concept it should be accomapanied with appropriate explanation in the Sharding user guide section, and in the release notes (which could just be a link to the sharding user guide section).

Happy to try and review in full once we work these points out (especially the second one).

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 11, 2025

It is a bit confusing that .shards now returns None with just chunks, where all the other shard releated properties return the equivalent chunk values. I would suggest returning self.chunks from .shards if there's no sharding to match the other new properties, and adding a new .is_sharded: bool property to determine whether shards are being used (previously one would have done self.shards is None).

I agree that it's confusing, and I also agree that the Array.shards attribute should make sense in light of the other various shard-related properties. But an is_sharded property might also be confusing, especially for an array where is_sharded == False but shards = <some tuple>. What about naming this property something like uses_sharding_codec? IMO we should be normalizing "all arrays are sharded" as much as possible across the entire stack, and to me this argues for treating the way an array is sharded (via the sharding codec, or not) as an implementation detail. But I don't feel strongly here except that whatever we do should not be confusing.

And sorry for all the pings recently but @zarr-developers/python-core-devs it would be great to get some other POVs here, since we are talking about some potential changes to the user-facing Array object.

For this PR, i'm totally fine making all the new shard methods private until we can decide on a good story for the array API. We can still use these methods to fix the various bugs related to mixing up chunks and shards. This would also address your second point, because we wouldn't need to add any new user-facing docs at this time.

@dcherian
Copy link
Contributor

What about naming this property something like uses_sharding_codec?

This is better.

IMO we should be normalizing "all arrays are sharded" as much as possible across the entire stack, and to me this argues for treating the way an array is sharded (via the sharding codec, or not) as an implementation detail.

I'm 50/50 on this, given that historically Zarr has only had chunks, not shards; and that these are also kwargs to the constructor. It's going to be confusing regardless. Should we just call it shards_or_chunks :D

i'm totally fine making all the new shard methods private until we can decide on a good story for the array API.

Let's do this ASAP, and punt on the changing the meaning of the .shards property.

@dstansby
Copy link
Contributor

Perhaps the recommendation to replace array.shards == None could be array.shards == array.chunks instead?

Certainly .shards shouldn't be changed in code that goes in 3.1.x now, so again I'd advocate for any new API being private, and only the minimum API added that's needed to fix the original issue.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 13, 2025

the new API is all private, let me know if we need to do anything else

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 21, 2025

ping @dstansby @dcherian, let me know if we need anything else here.

@d-v-b d-v-b requested a review from a team August 22, 2025 12:30
@maxrjones maxrjones mentioned this pull request Aug 22, 2025
25 tasks
Comment on lines -137 to +141
>>> tuple(iter_grid((2,3)), origin=(1,1))
((1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3))
>>> tuple(iter_grid((2,3), origin=(1,1)))
((1, 1), (1, 2))

>>> tuple(iter_grid((2,3)), origin=(1,1), selection_shape=(2,2))
((1, 1), (1, 2), (1, 3), (2, 1))
>>> tuple(iter_grid((2,3), origin=(0,0), selection_shape=(2,2)))
((0, 0), (0, 1), (1, 0), (1, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah that was so wrong earlier!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤡

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
for g_pos, r_shape, d_shape in zip(grid_position, region_shape, domain_shape, strict=True):
start = g_pos * r_shape
stop = start + r_shape
if trim_excess:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, is there a reason to not trim excess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the chunks / shards written to storage are always fully-sized, no matter the shape of the array. trim excess lets you control whether iteration is done over array-indexing space (trim_excess=True) or stored objects indexing space (trim_excess=False), for the same shape


# write chunks one at a time
for idx, region in enumerate(arr._iter_chunk_regions()):
for idx, region in enumerate(arr._iter_shard_regions()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are a bit circular in that the n*_initialized property uses this function under the hood.

Not a blocker, but would be good to think about making this test more independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, maybe later we can simplify this function and make each shard 1x1x1, then iterate over array elements directly

)
else:
selection_shape_parsed = selection_shape
for d_s, r_s, o, ss in zip(
Copy link
Contributor

@dcherian dcherian Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here this is reimplementing the logic being tested.

Would be good to add some property tests in the future, for example,

  1. I think we know what the last index of the last slice should be, depending on trim_excess being True or False.
  2. we never receive a slice with start < corresponding index in origin.
  3. output shape == selection_shape (with some trim_excess dependency perhaps)

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have some concerns about better testing these functions, but they can be addressed later.

@d-v-b d-v-b merged commit c9eefe6 into zarr-developers:main Aug 22, 2025
31 checks passed
Copy link

lumberbot-app bot commented Aug 22, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 3.1.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 c9eefe663b34b019e9673bcaa37a88aae6158280
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #3299: fix chunk/shard iteration'
  1. Push to a named branch:
git push YOURFORK 3.1.x:auto-backport-of-pr-3299-on-3.1.x
  1. Create a PR against branch 3.1.x, I would have named this PR:

"Backport PR #3299 on branch 3.1.x (fix chunk/shard iteration)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@d-v-b d-v-b deleted the fix/_iter_chunk_keys branch August 22, 2025 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants