Page MenuHomePhabricator

VM live migration failing for many/most VMs
Closed, ResolvedPublic

Description

I'm trying to drain cloudvirts for reboots and live migration is behaving badly.

Some VMs migrate just fine

Most of the time, the VM is partially migrated and started on the destination host (so that I can see it in virsh and ps) but then the VM on the old host is never stopped and nova reports 'migrating' status forever. This produces a whole lot of warnings about the VM running where it doesn't belong.

Possible contributing factors:

  • I restarted all nodes in the control plane earlier

Attempted fixes:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
[None req-e95a879c-553e-4469-a2b3-ec2d3c0168c0 novaadmin admin - - default default] [instance: b54ae99c-fce8-4f0a-bb32-33d5a45e44b2] Live Migration failure: internal error: process exited while connecting to monitor: 2025-01-30T23:02:31.013230Z qemu-system-x86_64: -blockdev {"driver":"rbd","pool":"eqiad1-compute","image":"b54ae99c-fce8-4f0a-bb32-33d5a45e44b2_disk","server":[{"host":"10.64.20.69","port":"6789"},{"host":"10.64.20.68","port":"6789"},{"host":"10.64.20.67","port":"6789"}],"user":"eqiad1-compute","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-auth-secret0","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}: error connecting: Connection timed out: libvirt.libvirtError: internal error: process exited while connecting to monitor: 2025-01-30T23:02:31.013230Z qemu-system-x86_64: -blockdev {"driver":"rbd","pool":"eqiad1-compute","image":"b54ae99c-fce8-4f0a-bb32-33d5a45e44b2_disk","server":[{"host":"10.64.20.69","port":"6789"},{"host":"10.64.20.68","port":"6789"},{"host":"10.64.20.67","port":"6789"}],"user":"eqiad1-compute","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-auth-secret0","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}: error connecting: Connection timed out

indeed, when the VM starts up on the new host it is trying to contact cloudcephmons that don't exist anymore. So this is a related issue to T383583

Need to figure out which (if any) of the following will resolve the issue:

  • reboot (from w/in the VM)
  • hard reboot from horizon
  • cold migrate

And then sort out VMs that need this from those that don't, and schedule things.

andrew@cloudcumin1001:~$ sudo cumin --force  "cloudvirt1*" 'ps -ef | grep 10.64.20.68'

shows the scope of the issue. Need to add some sed to that to extract a list of VM IDs.

andrew@cloudcumin1001:~$ sudo cumin --force "cloudvirt1*" "ps -ef | grep 10.64.20.68 | grep -v grep | sed 's/^.*-uuid ' | sed 's/ .*g' "

reboot from within the VM does not work, hard reboot seems to work (for a single test)

Change #1116059 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[cloud/wmcs-cookbooks@main] wmcs.toolforge.k8s.reboot: always do reboot --hard

https://gerrit.wikimedia.org/r/1116059

Mentioned in SAL (#wikimedia-cloud) [2025-02-01T14:29:06Z] <andrewbogott> rebooting all k8s-nfs worker nodes for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-01T15:01:48Z] <andrewbogott> rebooting all k8s (non-nfs) worker nodes for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-01T15:14:39Z] <andrewbogott> hard rebooting all VMs for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T12:07:57Z] <andrewbogott> rebooting all servers for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T12:20:41Z] <andrewbogott> hard rebooted 6 workers for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T13:45:14Z] <andrewbogott> cold-migrating all remaining VMs in T385264 except for 'integration' and 'tools' VMs

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T14:06:50Z] <andrewbogott> cold-migrating tools-proxy-8 for T385264; will cause a brief toolforge outage

Change #1116059 abandoned by Andrew Bogott:

[cloud/wmcs-cookbooks@main] wmcs.toolforge.k8s.reboot: always do reboot --hard

Reason:

This has server its purpose; we don't actually want to do this most of the time.

https://gerrit.wikimedia.org/r/1116059