Implement new label format for large disks #17573

pcd1193182 · 2025-07-28T17:58:37Z

Sponsored by: [Wasabi, Inc.; Klara, Inc.]

Motivation and Context

As disk sector sizes increase, we are able to store fewer and fewer uberblocks on a disk. This makes it increasingly difficult to recover from issues by rolling back to earlier TXGs. Eventually, sector sizes may become large enough that not even a single uberblock can be stored without having to do a partial write. In addition, new ZFS features often need space to store metadata (see, for example, the buffer used by RAIDZ expansion). This space is highly limited with the current disk layout.

Description

This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards.

It also contains zdb and zhack support for the new label type, as well as tests that verify basic functionality of the new label. Currently, the size of the disk is used as a rubric for whether or not to enable the new label type, but that is open to change.

How Has This Been Tested?

In addition to the tests added in this PR, I also ran the ZFS test suite with the tunable turned below the size of the disks in use. Some tests failed, but only for space estimation reasons, which could have been corrected with fixes to the tests. Similarly, I ran some ztest runs with the new label format.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

include/sys/vdev_impl.h

This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi, Inc. Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

robn

This seems to be a bit more polished version of the series I saw a few months ago, which I think bodes well - nothing bad or surprising seen since!

So what's still needed to move this forward? And what's the plan from here? Is the intent to get this merged and onto real pools before there's a new feature that requires it, or hold it until we need it? I'm guessing/hoping the former, to get operational experience and shake out any issues before we actually need it.

Any thoughts or guidance on how to use all this new space? I don't really at this stage, and I'm don't think there's a big long line of things waiting to use it. Regardless, as with most of our on-disk formats, if we're upgrading them to throw of limitations of the past, I would like this to be the last time we ever have to, and that in part means making sure we know how to use it and never break it!

robn · 2025-09-26T05:16:56Z

tests/zfs-tests/tests/functional/large_label/large_label_001_pos.ksh

+
+log_must create_pool -f $TESTPOOL "$DSK"0
+
+log_must zdb -l "$DSK"0


Since user_large_label and uses_old_label just call zdb -l anyway, I assume this is just here (and in large_label_002_pos) to get the output into the log for debugging?

Yeah, if the test fails, it's nice to not have to modify and rerun just to see what the zdb output actually looked like, or if that was the command that failed directly, or we just failed to find the string we wanted.

pcd1193182 · 2025-10-01T20:54:40Z

So what's still needed to move this forward? And what's the plan from here? Is the intent to get this merged and onto real pools before there's a new feature that requires it, or hold it until we need it? I'm guessing/hoping the former, to get operational experience and shake out any issues before we actually need it.

My hope was to get this integrated sooner rather than later. As you said, it would be good to have time to find any issues or make improvements before there's a new feature that needs it, and drives a lot of sudden new adoption of something that hasn't had as much time to mature. Plus, while there aren't any new headline features in this patch, we do still get the benefit of having a longer uberblock history. For a lot of data recovery jobs, that alone can prove quite helpful, since it provides a much longer window of TXGs to roll back to to try and recover specific data.

The thing that's needed to move it forward is reviews, pretty much. I think the bones are good, though I'm sure there are tweaks to be made, and I think it's ready for more eyes on it.

Any thoughts or guidance on how to use all this new space? I don't really at this stage, and I'm don't think there's a big long line of things waiting to use it. Regardless, as with most of our on-disk formats, if we're upgrading them to throw of limitations of the past, I would like this to be the last time we ever have to, and that in part means making sure we know how to use it and never break it!

Right now, we store the checkpoint uberblock in the MOS. This works mostly fine for the intended use cases. However, if your pool is rendered totally unimportable (a bug in the import code related to a new feature that causes it to panic, or really specifically timed courrption, for example) can make it impossible to roll back. Storing a copy of the checkpoint uberblock in the label as a backup and having a new import flag, or zhack or other way to do the rollback might be useful. One thing this PR does include is storing a copy of the pool config in the label. That isn't currently used for anything except debugging, but it could be handy for importing pool with badly damaged or missing top-level vdevs.

Another idea that came to me is the idea of storing a compression dictionary for use with zstd; zstd has a dictionary mode, where rather than storing the dictionary inline with the data, it can use an external pre-programmed dictionary. It might be helpful (especially for smaller recordsizes) to generate dictionaries based ZFS metadata, or allow users to generate them based on their own data, and then use them for compression and decompression. If they're used to store metadata, they may need to be accessed before the MOS is readable, so storing them in the label might help.

We currently use the 3.5 MiB reserved space to do raidz expansion; that space is sufficient because eventually the raidz expansion can use its own previously allocated space as working room. If we ever wanted to try to implement raidz width increases (increasing the parity of existing blocks, for example), we would need more space; the larger labels might provide enough scratch space for that.

I agree it would be nice if we never had to do this again; we don't want to come back in ten years and say "hey actually now we need 64GiB labels, whoops!". One thing that I think works well in the current design to prevent that is the table of contents; that structure can contain not only information about the different sections in the label, but about label extensions or other features that the label is using.

pcd1193182 force-pushed the new_label branch 4 times, most recently from e039970 to c20fcf4 Compare July 31, 2025 19:27

behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Jul 31, 2025

amotin reviewed Aug 8, 2025

View reviewed changes

include/sys/vdev_impl.h Outdated Show resolved Hide resolved

include/sys/vdev_impl.h Outdated Show resolved Hide resolved

include/sys/vdev_impl.h Outdated Show resolved Hide resolved

allanjude reviewed Aug 22, 2025

View reviewed changes

include/sys/vdev_impl.h Show resolved Hide resolved

pcd1193182 force-pushed the new_label branch 3 times, most recently from 821000e to 8c65661 Compare August 28, 2025 23:55

pcd1193182 force-pushed the new_label branch 3 times, most recently from 8567011 to 6718ba5 Compare September 15, 2025 17:07

Paul Dagnelie added 3 commits September 25, 2025 12:32

Rob's feedback

9777ff4

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

mav feedback

f2adf61

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

pcd1193182 force-pushed the new_label branch from 6718ba5 to f2adf61 Compare September 25, 2025 20:07

robn reviewed Sep 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement new label format for large disks #17573

Implement new label format for large disks #17573

Uh oh!

pcd1193182 commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robn left a comment

Uh oh!

robn Sep 26, 2025

Uh oh!

pcd1193182 Oct 1, 2025

Uh oh!

pcd1193182 commented Oct 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		log_must create_pool -f $TESTPOOL "$DSK"0

		log_must zdb -l "$DSK"0

Implement new label format for large disks #17573

Are you sure you want to change the base?

Implement new label format for large disks #17573

Uh oh!

Conversation

pcd1193182 commented Jul 28, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robn left a comment

Choose a reason for hiding this comment

Uh oh!

robn Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pcd1193182 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

pcd1193182 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pcd1193182 commented Oct 1, 2025 •

edited

Loading