-
-
Notifications
You must be signed in to change notification settings - Fork 356
fix chunk/shard iteration #3299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nd add a failing test
Co-authored-by: Bojidar Marinov <bojidar.marinov.bg@gmail.com>
src/zarr/core/array.py
Outdated
return self.chunk_grid_shape | ||
|
||
@property | ||
def chunk_grid_shape(self) -> ChunkCoords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new property called chunk_grid_shape
because cdata_shape
is ambiguous. cdata_shape
is still around, but it now uses chunk_grid_shape
src/zarr/core/array.py
Outdated
return tuple(starmap(ceildiv, zip(self.shape, self.chunks, strict=True))) | ||
|
||
@property | ||
def shard_grid_shape(self) -> ChunkCoords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this complements chunk_grid_shape
.
…nto fix/_iter_chunk_keys
Ooh, that looks like a much more complete implementation of what I did in that other PR! Kudos; I wouldn't have been able to push things this far ✨✨ |
It turned out to be more than I expected 😅 . |
src/zarr/core/indexing.py
Outdated
|
||
else: | ||
msg = f"Indexing order {order} is not supported at this time." # type: ignore[unreachable] | ||
raise NotImplementedError(msg) | ||
|
||
|
||
def iter_regions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new function for iterating over contiguous regions. When we support irregular chunking, we can overload the type of region_shape
accordingly.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3299 +/- ##
==========================================
+ Coverage 94.62% 94.69% +0.07%
==========================================
Files 79 79
Lines 9468 9520 +52
==========================================
+ Hits 8959 9015 +56
+ Misses 509 505 -4
🚀 New features to boost your workflow:
|
Just to make sure we're on the same page here, this is where we're heading in my mind, as something that can go in the user guide:
|
I think my earlier comments still needs addressing here:
and
Happy to try and review in full once we work these points out (especially the second one). |
I agree that it's confusing, and I also agree that the And sorry for all the pings recently but @zarr-developers/python-core-devs it would be great to get some other POVs here, since we are talking about some potential changes to the user-facing For this PR, i'm totally fine making all the new |
This is better.
I'm 50/50 on this, given that historically Zarr has only had chunks, not shards; and that these are also kwargs to the constructor. It's going to be confusing regardless. Should we just call it
Let's do this ASAP, and punt on the changing the meaning of the |
Perhaps the recommendation to replace Certainly |
the new API is all private, let me know if we need to do anything else |
…x/_iter_chunk_keys
…x/_iter_chunk_keys
>>> tuple(iter_grid((2,3)), origin=(1,1)) | ||
((1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3)) | ||
>>> tuple(iter_grid((2,3), origin=(1,1))) | ||
((1, 1), (1, 2)) | ||
|
||
>>> tuple(iter_grid((2,3)), origin=(1,1), selection_shape=(2,2)) | ||
((1, 1), (1, 2), (1, 3), (2, 1)) | ||
>>> tuple(iter_grid((2,3), origin=(0,0), selection_shape=(2,2))) | ||
((0, 0), (0, 1), (1, 0), (1, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah that was so wrong earlier!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤡
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
for g_pos, r_shape, d_shape in zip(grid_position, region_shape, domain_shape, strict=True): | ||
start = g_pos * r_shape | ||
stop = start + r_shape | ||
if trim_excess: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, is there a reason to not trim excess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the chunks / shards written to storage are always fully-sized, no matter the shape of the array. trim excess lets you control whether iteration is done over array-indexing space (trim_excess=True) or stored objects indexing space (trim_excess=False), for the same shape
|
||
# write chunks one at a time | ||
for idx, region in enumerate(arr._iter_chunk_regions()): | ||
for idx, region in enumerate(arr._iter_shard_regions()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests are a bit circular in that the n*_initialized
property uses this function under the hood.
Not a blocker, but would be good to think about making this test more independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point, maybe later we can simplify this function and make each shard 1x1x1, then iterate over array elements directly
) | ||
else: | ||
selection_shape_parsed = selection_shape | ||
for d_s, r_s, o, ss in zip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here this is reimplementing the logic being tested.
Would be good to add some property tests in the future, for example,
- I think we know what the last index of the last slice should be, depending on
trim_excess
being True or False. - we never receive a slice with
start
< corresponding index in origin. output shape == selection_shape
(with some trim_excess dependency perhaps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I have some concerns about better testing these functions, but they can be addressed later.
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
In
main
there are some routines for iterating over the chunks of an array, but these routines do not distinguish between chunks and shards (i.e., stored objects) for arrays with sharding.This PR adds a separate set of shard-specific iteration routines to complement our chunk-specific iteration routines. Various bugs related to iterating over chunks, when shards were the intended iteration target, have been fixed by these changes, notably bugs causing memory races when creating arrays via
create_array
(xref ##3169)I think this supersedes #3217, @bojidar-bg I credited you as a co-author on one of these commits because your idea to change the iteration from chunks to shards was correct.