VM live migration failing for many/most VMs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jan 31 2025, 11:34 AM

Description

I'm trying to drain cloudvirts for reboots and live migration is behaving badly.

Some VMs migrate just fine

Most of the time, the VM is partially migrated and started on the destination host (so that I can see it in virsh and ps) but then the VM on the old host is never stopped and nova reports 'migrating' status forever. This produces a whole lot of warnings about the VM running where it doesn't belong.

Possible contributing factors:

I restarted all nodes in the control plane earlier

Attempted fixes:

Did a complete reset and restart of rabbitmq
Restarted all openstack services several times
Noticed that the restart --all cookbook does not restart cinder-api, so restarted that by hand (and made a patch, https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1115509)

Details

	Subject	Repo	Branch	Lines +/-
	wmcs.toolforge.k8s.reboot: always do reboot --hard	cloud/wmcs-cookbooks	main	+4 -3

Customize query in gerrit

Related Objects

Mentioned In: T385288: Changing the IPs of cloudcephmons should not require VM reboots
Mentioned Here: T383583: VM nova records attached to incorrect cloudcephmon IPs

Event Timeline

Andrew created this task.Jan 31 2025, 11:34 AM

Restricted Application added a project: cloud-services-team. · View Herald TranscriptJan 31 2025, 11:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

[None req-e95a879c-553e-4469-a2b3-ec2d3c0168c0 novaadmin admin - - default default] [instance: b54ae99c-fce8-4f0a-bb32-33d5a45e44b2] Live Migration failure: internal error: process exited while connecting to monitor: 2025-01-30T23:02:31.013230Z qemu-system-x86_64: -blockdev {"driver":"rbd","pool":"eqiad1-compute","image":"b54ae99c-fce8-4f0a-bb32-33d5a45e44b2_disk","server":[{"host":"10.64.20.69","port":"6789"},{"host":"10.64.20.68","port":"6789"},{"host":"10.64.20.67","port":"6789"}],"user":"eqiad1-compute","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-auth-secret0","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}: error connecting: Connection timed out: libvirt.libvirtError: internal error: process exited while connecting to monitor: 2025-01-30T23:02:31.013230Z qemu-system-x86_64: -blockdev {"driver":"rbd","pool":"eqiad1-compute","image":"b54ae99c-fce8-4f0a-bb32-33d5a45e44b2_disk","server":[{"host":"10.64.20.69","port":"6789"},{"host":"10.64.20.68","port":"6789"},{"host":"10.64.20.67","port":"6789"}],"user":"eqiad1-compute","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-auth-secret0","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}: error connecting: Connection timed out

indeed, when the VM starts up on the new host it is trying to contact cloudcephmons that don't exist anymore. So this is a related issue to T383583

Need to figure out which (if any) of the following will resolve the issue:

reboot (from w/in the VM)
hard reboot from horizon
cold migrate

And then sort out VMs that need this from those that don't, and schedule things.

andrew@cloudcumin1001:~$ sudo cumin --force  "cloudvirt1*" 'ps -ef | grep 10.64.20.68'

shows the scope of the issue. Need to add some sed to that to extract a list of VM IDs.

andrew@cloudcumin1001:~$ sudo cumin --force "cloudvirt1*" "ps -ef | grep 10.64.20.68 | grep -v grep | sed 's/^.*-uuid ' | sed 's/ .*g' "

reboot from within the VM does not work, hard reboot seems to work (for a single test)

fnegri mentioned this in T385288: Changing the IPs of cloudcephmons should not require VM reboots.Jan 31 2025, 2:50 PM

Change #1116059 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[cloud/wmcs-cookbooks@main] wmcs.toolforge.k8s.reboot: always do reboot --hard

https://gerrit.wikimedia.org/r/1116059

gerritbot added a project: Patch-For-Review.Feb 1 2025, 3:16 AM

Mentioned in SAL (#wikimedia-cloud) [2025-02-01T14:29:06Z] <andrewbogott> rebooting all k8s-nfs worker nodes for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-01T15:01:48Z] <andrewbogott> rebooting all k8s (non-nfs) worker nodes for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-01T15:14:39Z] <andrewbogott> hard rebooting all VMs for T385264

joanna_borun triaged this task as High priority.Feb 5 2025, 3:12 PM

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T12:07:57Z] <andrewbogott> rebooting all servers for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T12:20:41Z] <andrewbogott> hard rebooted 6 workers for T385264

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T13:45:14Z] <andrewbogott> cold-migrating all remaining VMs in T385264 except for 'integration' and 'tools' VMs

Mentioned in SAL (#wikimedia-cloud) [2025-02-06T14:06:50Z] <andrewbogott> cold-migrating tools-proxy-8 for T385264; will cause a brief toolforge outage

Andrew closed this task as Resolved.Feb 6 2025, 7:02 PM

Change #1116059 abandoned by Andrew Bogott:

[cloud/wmcs-cookbooks@main] wmcs.toolforge.k8s.reboot: always do reboot --hard

Reason:

This has server its purpose; we don't actually want to do this most of the time.