-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implement new label format for large disks #17573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e039970 to
c20fcf4
Compare
821000e to
8c65661
Compare
8567011 to
6718ba5
Compare
This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi, Inc. Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
6718ba5 to
f2adf61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a bit more polished version of the series I saw a few months ago, which I think bodes well - nothing bad or surprising seen since!
So what's still needed to move this forward? And what's the plan from here? Is the intent to get this merged and onto real pools before there's a new feature that requires it, or hold it until we need it? I'm guessing/hoping the former, to get operational experience and shake out any issues before we actually need it.
Any thoughts or guidance on how to use all this new space? I don't really at this stage, and I'm don't think there's a big long line of things waiting to use it. Regardless, as with most of our on-disk formats, if we're upgrading them to throw of limitations of the past, I would like this to be the last time we ever have to, and that in part means making sure we know how to use it and never break it!
|
|
||
| log_must create_pool -f $TESTPOOL "$DSK"0 | ||
|
|
||
| log_must zdb -l "$DSK"0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since user_large_label and uses_old_label just call zdb -l anyway, I assume this is just here (and in large_label_002_pos) to get the output into the log for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if the test fails, it's nice to not have to modify and rerun just to see what the zdb output actually looked like, or if that was the command that failed directly, or we just failed to find the string we wanted.
My hope was to get this integrated sooner rather than later. As you said, it would be good to have time to find any issues or make improvements before there's a new feature that needs it, and drives a lot of sudden new adoption of something that hasn't had as much time to mature. Plus, while there aren't any new headline features in this patch, we do still get the benefit of having a longer uberblock history. For a lot of data recovery jobs, that alone can prove quite helpful, since it provides a much longer window of TXGs to roll back to to try and recover specific data. The thing that's needed to move it forward is reviews, pretty much. I think the bones are good, though I'm sure there are tweaks to be made, and I think it's ready for more eyes on it.
Right now, we store the checkpoint uberblock in the MOS. This works mostly fine for the intended use cases. However, if your pool is rendered totally unimportable (a bug in the import code related to a new feature that causes it to panic, or really specifically timed courrption, for example) can make it impossible to roll back. Storing a copy of the checkpoint uberblock in the label as a backup and having a new import flag, or zhack or other way to do the rollback might be useful. One thing this PR does include is storing a copy of the pool config in the label. That isn't currently used for anything except debugging, but it could be handy for importing pool with badly damaged or missing top-level vdevs. Another idea that came to me is the idea of storing a compression dictionary for use with zstd; zstd has a dictionary mode, where rather than storing the dictionary inline with the data, it can use an external pre-programmed dictionary. It might be helpful (especially for smaller recordsizes) to generate dictionaries based ZFS metadata, or allow users to generate them based on their own data, and then use them for compression and decompression. If they're used to store metadata, they may need to be accessed before the MOS is readable, so storing them in the label might help. We currently use the 3.5 MiB reserved space to do raidz expansion; that space is sufficient because eventually the raidz expansion can use its own previously allocated space as working room. If we ever wanted to try to implement raidz width increases (increasing the parity of existing blocks, for example), we would need more space; the larger labels might provide enough scratch space for that. I agree it would be nice if we never had to do this again; we don't want to come back in ten years and say "hey actually now we need 64GiB labels, whoops!". One thing that I think works well in the current design to prevent that is the table of contents; that structure can contain not only information about the different sections in the label, but about label extensions or other features that the label is using. |
Sponsored by: [Wasabi, Inc.; Klara, Inc.]
Motivation and Context
As disk sector sizes increase, we are able to store fewer and fewer uberblocks on a disk. This makes it increasingly difficult to recover from issues by rolling back to earlier TXGs. Eventually, sector sizes may become large enough that not even a single uberblock can be stored without having to do a partial write. In addition, new ZFS features often need space to store metadata (see, for example, the buffer used by RAIDZ expansion). This space is highly limited with the current disk layout.
Description
This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards.
It also contains zdb and zhack support for the new label type, as well as tests that verify basic functionality of the new label. Currently, the size of the disk is used as a rubric for whether or not to enable the new label type, but that is open to change.
How Has This Been Tested?
In addition to the tests added in this PR, I also ran the ZFS test suite with the tunable turned below the size of the disks in use. Some tests failed, but only for space estimation reasons, which could have been corrected with fixes to the tests. Similarly, I ran some ztest runs with the new label format.
Types of changes
Checklist:
Signed-off-by.