UEFI installer not installing grub correctly (at least on systems where / is RAID)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	MatthewVernon
	Sep 11 2025, 2:13 PM

Description

I've been trying to install trixie onto sretest2010 (which was set up in T394357), and one of the problems I'm finding is that the installer isn't installing grub correctly, leading to a system that can't boot (or boots back into the installer). I did also find this with another ms-be node (ms-be1083). These use LVM RAID1 for /

The failure mode is that on reimage the node reboots after the installer has completed and fails into the grub rescue mode with an error like:

error: disk `mduuid/3207fa1071e844ffdc954a0ec74fddbd' not found.

The problem being that the mduuid is from a previous install. Alternatively, if you've wiped enough disks correctly (the key thing being to make sure that first partition of the two SSDs gets blanked) then after the first install, the system will attempt to boot from disk, fail, and boot back into the installer - and then succeed after that.

As best as I can tell, the installation is not correctly ensuring that the first or /boot/efi partition on both SSDs is written to (not surprisingly, I guess, given only one of them gets mounted), and so if the installer writes onto the "wrong" SSD and the system boots based off the other one, then it has the wrong mduuid embedded.

When watching the installer, it does say that it's doing "grub-install sdm sdn" or similar, so it _ought_ to be attempting to write to both disks. Likewise, if you manage to get one of these systems to boot from the rescue prompt and then run grub-install from the booted system, it then seems to work reliably.

It's not a problem on BIOS-booted systems (I think because there isn't a mounted /boot/efi involved?), but is going to be a real problem if/when we start trying to reimage a bunch of these swift backends that UEFI boot.

Details

Related Changes in Gerrit:

	Subject	Repo	Branch	Lines +/-
	preseed: set ms-be2078 for UEFI	operations/puppet	production	+1 -1
	swift: re-add 2 nodes, drain the final 2, leave 1 for testing	operations/puppet	production	+6 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T404356 UEFI installer not installing grub correctly (at least on systems where / is RAID)
		Open		None	T406964 No disk boot option when moving ms-be2078 to UEFI

Event Timeline

MatthewVernon created this task.Sep 11 2025, 2:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2025, 2:13 PM

CDanis subscribed.Sep 15 2025, 2:38 PM

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

jhathaway subscribed.Sep 22 2025, 2:55 PM

Does /boot even need to be on a separate partition for UEFI booting?

LSobanski subscribed.Thu, Sep 25, 10:48 AM

In T404356#11209467, @MatthewVernon wrote:

Does /boot even need to be on a separate partition for UEFI booting?

No, however, the UEFI ESP partition does need to be on a separate partition with a FAT32 filesystem. The EFI firmware searches each drive for such a partition to discover EFI boot files. Debian only installs Grub on the ESP, so Grub in turn needs to be able to read the Linux kernel out of /boot. Grub does not care whether /boot is a separate partition or co-mingled with /, the main requirement is that the partition's filesystem is supported by Grub.

In T404356#11184299, @elukey wrote:

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

I spent some time trying to debug the woes with this host, but the behavior is very strange.

Things I tried.

Reset Bios to optimized defaults
Re-installed the same version of the Bios, while discarding all settings except SMBIOS
Issued a cold reset to the BMC

But, none of my actions changed the behavior of the BMC, notably issuing a reset /system1/pwrmgtsvc1 or a stop /system1/pwrmgtsvc1 command do not seem to have any effect.

To recap, it seems that we have two problems:

For some mysterious reasons, sretest2010 seems to have stopped working correctly at the BMC level (resets not happening, etc..). This is not great since we cannot easily test reimages, so we need to fix this problem first. Let's keep all BMC-related investigations in T394357.

I had no problems installing the OS in T394357, the host was set with standard-efi + raid1-2dev-efi configs before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185973. And I don't recall this issue happening in any of the previous UEFI installs, so I am wondering if it is, for some reason, related to partman early command?

elukey mentioned this in T394357: Q4:rack/setup/install sretest2010 Config J 1P test host.Mon, Sep 29, 9:03 AM

elukey triaged this task as Medium priority.Mon, Sep 29, 2:51 PM

@elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* nodes, it's going to be a real pain for any OS upgrade (which is getting urgent given we're still on bullseye...). Which is not to say the problem doesn't lie with something I wrote in the partman setup!

In T404356#11225571, @MatthewVernon wrote:

@elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* nodes, it's going to be a real pain for any OS upgrade (which is getting urgent given we're still on bullseye...). Which is not to say the problem doesn't lie with something I wrote in the partman setup!

@MatthewVernon I totally agree, what I meant is to find a way to narrow down the possible source of the problem, not to dismiss your request :) Basically I'd like to test the host with its previous "standard" recipe again (when the hardware will work) to figure out if I missed the problem just because of luck, or if it can be reproducible also with standard recipes. We'll find a solution, reimaging in this condition is not great and painful.

Change #1190674 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

gerritbot added a project: Patch-For-Review.Tue, Sep 30, 8:37 AM

Change #1190674 merged by MVernon:

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

Maintenance_bot removed a project: Patch-For-Review.Tue, Sep 30, 12:33 PM

A couple of notes, so I have a record of what I've done, and in case they're of any help!

I've just re-imaged ms-be1086 and ms-be1087 (both UEFI), and blanked the partition mounted as /boot/efi before reimage (which subsequently proceded without problems). In both cases, there is after reimage a partition labelled as EFI System Partition on both system disks e.g.:

mvernon@ms-be1086:~$ sudo blkid /dev/sda1 /dev/sdb1
/dev/sda1: UUID="B2CC-32D0" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="2b78ca7b-2ff2-443e-bb43-3bcc6db6dfbd"
/dev/sdb1: UUID="B2CB-38A7" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="96277097-e3e2-4a08-be8a-93ae18d50c4c"

But the not-mounted partition has an empty filesystem (i.e. you can mount it, but it has nothing in). When watching the reimage, the installer does say something to the effect of "Running grub-install /dev/sda /dev/sdb" towards the end of the install process. But the result is seemingly an empty FS.

I looked at a system with standard-efi.cfg and raid1-2dev-efi.cfg - an-test-coord1002. As expected, it has /dev/sdb2 mounted as /boot/efi and also an EFI System Partition on /dev/sda2. I mounted it and compared the contents:

mvernon@an-test-coord1002:~$ sudo ls -l /mnt/EFI/debian/grubx64.efi
-rwxr-xr-x 1 root root 167936 Aug 21 22:20 /mnt/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo ls -l /boot/efi/EFI/debian/grubx64.efi
-rwx------ 1 root root 167936 Aug 22 17:25 /boot/efi/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo md5sum /mnt/EFI/debian/grubx64.efi
a2119e99fceafce1de3488c5ddbde073  /mnt/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo md5sum /boot/efi/EFI/debian/grubx64.efi
92f592110127ebea4829165012cff37e  /boot/efi/EFI/debian/grubx64.efi

The MOTD on this system tells me Debian GNU/Linux 12 auto-installed on Fri Aug 22 17:25:47 UTC 2025., which I think tells us that one EFI was setup during the install, and the other has been done later. This would tend to support the theory that the installer at the moment is not correctly writing the new EFI system partition to both system disks.

I had a look for obvious differences between the ms-be preseeding and standard-efi+raid1-2dev-efi setups. raid1-2dev-efi sets d-i grub-installer/only_debian boolean false with a comment referring to Debian #666974 which is long-closed. And the partitioning makes a small biosgrub partition (which I don't think is necessary any more).

I also found a couple of notes on archwiki and Debian wiki on the issues with EFI and systems doing software RAID1 for their system disks.

Finally, given the current problems with sretest2010 (T394357), I've delayed returning ms-be1088 to service so @elukey can do some more investigations with it.

MatthewVernon mentioned this in T400877: Install new disk controllers to SM swift backends (eqiad).Tue, Sep 30, 1:38 PM

@MatthewVernon thanks for the write-up! As FYI Jesse is working on T376949, that should address your concerns about the efi partition not being replicated. The thing that I don't get is why you see the error: disk mduuid/3207fa1071e844ffdc954a0ec74fddbd' not found.` error, because we never really got anything like that before.

My best theory on that is that one install run writes EFI to one disk (embedding the UUID), then a subsequent install run writes to the other disk (embedding the new UUID), leaving you with two EFI partitions for the hardware to "pick" to boot from, differing in the UUID they are looking for.

Trying to summarize the problem:

We know that the debian installer doesn't copy the EFI partition on all the disks in a sw raid setup. We have opened T376949, since so far the only issue that we had arose from disk failures (so the disk with the EFI partition populated breaks, and the other one can't boot).
I checked dse-k8s-worker1014, that runs with raid1-2dev-efi.cfg, but the non-mounted EFI partition on the other disk is not populated. So an-launcher1002 (mentioned above) has been probably done manually by someone.
I checked with Matthew and this issue is not always reproducible, sometimes it happens, sometimes things go fine.

I was also interested by:

The problem being that the mduuid is from a previous install. Alternatively, if you've wiped enough disks correctly (the key thing being to make sure that first partition of the two SSDs gets blanked) then after the first install, the system will attempt to boot from disk, fail, and boot back into the installer - and then succeed after that.

This seems to me a special case of the main one reported, since the wipe seems to lead to cleaner boot failure and triggers another PXE install. Shouldn't we have seen this issue more broadly across our fleet? It doesn't seem to be specific to some hosts, unless the disk controller of the swift hosts models plays a role during boot.

Mentioned in SAL (#wikimedia-operations) [2025-10-08T15:16:57Z] <elukey> reboot ms-be1088 as a test for T404356

I checked ms-be1088's boot properties and the disk boot option is debian(SATA,Port:0), that IIUC is being set by the Debian installer. It would be interesting to inspect this value when the issue occurs, to understand if it changed or not.

Matthew told me that ms-be2078 can be used for testing the reimage with UEFI, it is a Dell node with Legacy settings (so it needs to be reprovisioned, and its partman recipe needs to be updated).

Change #1194880 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] preseed: set ms-be2078 for UEFI

https://gerrit.wikimedia.org/r/1194880

gerritbot added a project: Patch-For-Review.Thu, Oct 9, 8:48 AM

Change #1194880 merged by Elukey:

[operations/puppet@production] preseed: set ms-be2078 for UEFI

https://gerrit.wikimedia.org/r/1194880

Maintenance_bot removed a project: Patch-For-Review.Thu, Oct 9, 9:31 AM

Tests on ms-be2078 are blocked by T406964 :(

While checking the BIOS/etc.. settings for ms-be2078 (Dell), I noticed that in the config util of the RAID controller there was a specific mention of what disk is marked as boot device (serial port combination), meanwhile I didn't find the same thing on ms-be1088 (Supermicro). I tried to look for T371400#10279452, I found a SAS 3816 config utility but I didn't manage to get into the same level of details, so there may be something that I am missing.

The next step is to test multiple reimages on ms-be2078 and see if we can repro, I have the feeling that what Matthew reported is a Supermicro-specific problem.

UEFI installer not installing grub correctly (at least on systems where / is RAID)Open, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

UEFI installer not installing grub correctly (at least on systems where / is RAID)
Open, MediumPublic
Actions

Related Objects
Search...