Page MenuHomePhabricator

UEFI installer not installing grub correctly (at least on systems where / is RAID)
Open, MediumPublic

Description

I've been trying to install trixie onto sretest2010 (which was set up in T394357), and one of the problems I'm finding is that the installer isn't installing grub correctly, leading to a system that can't boot (or boots back into the installer). I did also find this with another ms-be node (ms-be1083). These use LVM RAID1 for /

The failure mode is that on reimage the node reboots after the installer has completed and fails into the grub rescue mode with an error like:

error: disk `mduuid/3207fa1071e844ffdc954a0ec74fddbd' not found.

The problem being that the mduuid is from a previous install. Alternatively, if you've wiped enough disks correctly (the key thing being to make sure that first partition of the two SSDs gets blanked) then after the first install, the system will attempt to boot from disk, fail, and boot back into the installer - and then succeed after that.

As best as I can tell, the installation is not correctly ensuring that the first or /boot/efi partition on both SSDs is written to (not surprisingly, I guess, given only one of them gets mounted), and so if the installer writes onto the "wrong" SSD and the system boots based off the other one, then it has the wrong mduuid embedded.

When watching the installer, it does say that it's doing "grub-install sdm sdn" or similar, so it _ought_ to be attempting to write to both disks. Likewise, if you manage to get one of these systems to boot from the rescue prompt and then run grub-install from the booted system, it then seems to work reliably.

It's not a problem on BIOS-booted systems (I think because there isn't a mounted /boot/efi involved?), but is going to be a real problem if/when we start trying to reimage a bunch of these swift backends that UEFI boot.

Event Timeline

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

Does /boot even need to be on a separate partition for UEFI booting?

Does /boot even need to be on a separate partition for UEFI booting?

No, however, the UEFI ESP partition does need to be on a separate partition with a FAT32 filesystem. The EFI firmware searches each drive for such a partition to discover EFI boot files. Debian only installs Grub on the ESP, so Grub in turn needs to be able to read the Linux kernel out of /boot. Grub does not care whether /boot is a separate partition or co-mingled with /, the main requirement is that the partition's filesystem is supported by Grub.

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

I spent some time trying to debug the woes with this host, but the behavior is very strange.

Things I tried.

  1. Reset Bios to optimized defaults
  2. Re-installed the same version of the Bios, while discarding all settings except SMBIOS
  3. Issued a cold reset to the BMC

But, none of my actions changed the behavior of the BMC, notably issuing a reset /system1/pwrmgtsvc1 or a stop /system1/pwrmgtsvc1 command do not seem to have any effect.

To recap, it seems that we have two problems:

  1. For some mysterious reasons, sretest2010 seems to have stopped working correctly at the BMC level (resets not happening, etc..). This is not great since we cannot easily test reimages, so we need to fix this problem first. Let's keep all BMC-related investigations in T394357.
  1. I had no problems installing the OS in T394357, the host was set with standard-efi + raid1-2dev-efi configs before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185973. And I don't recall this issue happening in any of the previous UEFI installs, so I am wondering if it is, for some reason, related to partman early command?
elukey triaged this task as Medium priority.Mon, Sep 29, 2:51 PM

@elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* nodes, it's going to be a real pain for any OS upgrade (which is getting urgent given we're still on bullseye...). Which is not to say the problem doesn't lie with something I wrote in the partman setup!

@elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* nodes, it's going to be a real pain for any OS upgrade (which is getting urgent given we're still on bullseye...). Which is not to say the problem doesn't lie with something I wrote in the partman setup!

@MatthewVernon I totally agree, what I meant is to find a way to narrow down the possible source of the problem, not to dismiss your request :) Basically I'd like to test the host with its previous "standard" recipe again (when the hardware will work) to figure out if I missed the problem just because of luck, or if it can be reproducible also with standard recipes. We'll find a solution, reimaging in this condition is not great and painful.

Change #1190674 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

Change #1190674 merged by MVernon:

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

A couple of notes, so I have a record of what I've done, and in case they're of any help!

I've just re-imaged ms-be1086 and ms-be1087 (both UEFI), and blanked the partition mounted as /boot/efi before reimage (which subsequently proceded without problems). In both cases, there is after reimage a partition labelled as EFI System Partition on both system disks e.g.:

mvernon@ms-be1086:~$ sudo blkid /dev/sda1 /dev/sdb1
/dev/sda1: UUID="B2CC-32D0" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="2b78ca7b-2ff2-443e-bb43-3bcc6db6dfbd"
/dev/sdb1: UUID="B2CB-38A7" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="96277097-e3e2-4a08-be8a-93ae18d50c4c"

But the not-mounted partition has an empty filesystem (i.e. you can mount it, but it has nothing in). When watching the reimage, the installer does say something to the effect of "Running grub-install /dev/sda /dev/sdb" towards the end of the install process. But the result is seemingly an empty FS.

I looked at a system with standard-efi.cfg and raid1-2dev-efi.cfg - an-test-coord1002. As expected, it has /dev/sdb2 mounted as /boot/efi and also an EFI System Partition on /dev/sda2. I mounted it and compared the contents:

mvernon@an-test-coord1002:~$ sudo ls -l /mnt/EFI/debian/grubx64.efi
-rwxr-xr-x 1 root root 167936 Aug 21 22:20 /mnt/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo ls -l /boot/efi/EFI/debian/grubx64.efi
-rwx------ 1 root root 167936 Aug 22 17:25 /boot/efi/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo md5sum /mnt/EFI/debian/grubx64.efi
a2119e99fceafce1de3488c5ddbde073  /mnt/EFI/debian/grubx64.efi
mvernon@an-test-coord1002:~$ sudo md5sum /boot/efi/EFI/debian/grubx64.efi
92f592110127ebea4829165012cff37e  /boot/efi/EFI/debian/grubx64.efi

The MOTD on this system tells me Debian GNU/Linux 12 auto-installed on Fri Aug 22 17:25:47 UTC 2025., which I think tells us that one EFI was setup during the install, and the other has been done later. This would tend to support the theory that the installer at the moment is not correctly writing the new EFI system partition to both system disks.

I had a look for obvious differences between the ms-be preseeding and standard-efi+raid1-2dev-efi setups. raid1-2dev-efi sets d-i grub-installer/only_debian boolean false with a comment referring to Debian #666974 which is long-closed. And the partitioning makes a small biosgrub partition (which I don't think is necessary any more).

I also found a couple of notes on archwiki and Debian wiki on the issues with EFI and systems doing software RAID1 for their system disks.

Finally, given the current problems with sretest2010 (T394357), I've delayed returning ms-be1088 to service so @elukey can do some more investigations with it.

@MatthewVernon thanks for the write-up! As FYI Jesse is working on T376949, that should address your concerns about the efi partition not being replicated. The thing that I don't get is why you see the error: disk mduuid/3207fa1071e844ffdc954a0ec74fddbd' not found.` error, because we never really got anything like that before.

My best theory on that is that one install run writes EFI to one disk (embedding the UUID), then a subsequent install run writes to the other disk (embedding the new UUID), leaving you with two EFI partitions for the hardware to "pick" to boot from, differing in the UUID they are looking for.

Trying to summarize the problem:

  • We know that the debian installer doesn't copy the EFI partition on all the disks in a sw raid setup. We have opened T376949, since so far the only issue that we had arose from disk failures (so the disk with the EFI partition populated breaks, and the other one can't boot).
  • I checked dse-k8s-worker1014, that runs with raid1-2dev-efi.cfg, but the non-mounted EFI partition on the other disk is not populated. So an-launcher1002 (mentioned above) has been probably done manually by someone.
  • I checked with Matthew and this issue is not always reproducible, sometimes it happens, sometimes things go fine.

I was also interested by:

The problem being that the mduuid is from a previous install. Alternatively, if you've wiped enough disks correctly (the key thing being to make sure that first partition of the two SSDs gets blanked) then after the first install, the system will attempt to boot from disk, fail, and boot back into the installer - and then succeed after that.

This seems to me a special case of the main one reported, since the wipe seems to lead to cleaner boot failure and triggers another PXE install. Shouldn't we have seen this issue more broadly across our fleet? It doesn't seem to be specific to some hosts, unless the disk controller of the swift hosts models plays a role during boot.

Mentioned in SAL (#wikimedia-operations) [2025-10-08T15:16:57Z] <elukey> reboot ms-be1088 as a test for T404356

I checked ms-be1088's boot properties and the disk boot option is debian(SATA,Port:0), that IIUC is being set by the Debian installer. It would be interesting to inspect this value when the issue occurs, to understand if it changed or not.

Matthew told me that ms-be2078 can be used for testing the reimage with UEFI, it is a Dell node with Legacy settings (so it needs to be reprovisioned, and its partman recipe needs to be updated).

Change #1194880 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] preseed: set ms-be2078 for UEFI

https://gerrit.wikimedia.org/r/1194880

Change #1194880 merged by Elukey:

[operations/puppet@production] preseed: set ms-be2078 for UEFI

https://gerrit.wikimedia.org/r/1194880

Tests on ms-be2078 are blocked by T406964 :(

While checking the BIOS/etc.. settings for ms-be2078 (Dell), I noticed that in the config util of the RAID controller there was a specific mention of what disk is marked as boot device (serial port combination), meanwhile I didn't find the same thing on ms-be1088 (Supermicro). I tried to look for T371400#10279452, I found a SAS 3816 config utility but I didn't manage to get into the same level of details, so there may be something that I am missing.

The next step is to test multiple reimages on ms-be2078 and see if we can repro, I have the feeling that what Matthew reported is a Supermicro-specific problem.