OpenStack Compute (nova)

Live migrations failing due to remote host identification change

Bug #1969971 reported by Paul Goins on 2022-04-22

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	New	Undecided	Unassigned
	OpenStack Nova Cloud Controller Charm	In Progress	Undecided	Edward Hope-Morley

Bug Description

I've encountered a cloud where, for some reason (maybe a redeploy of a compute; I'm not sure), I'm hitting this error in nova-compute.log on the source node for an instance migration:

2022-04-22 10:21:17.419 3776 ERROR nova.virt.libvirt.driver [-] [instance: <REDACTED INSTANCE UUID>] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<REDACTED IP>/system: Cannot recv data: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:<REDACTED FINGERPRINT>.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /root/.ssh/known_hosts:97
remove with:
ssh-keygen -f "/root/.ssh/known_hosts" -R "<REDACTED IP>"
RSA host key for <REDACTED IP> has changed and you have requested strict checking.
Host key verification failed.: Connection reset by peer: libvirt.libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<REDACTED IP>/system: Cannot recv data: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

This interferes with instance migration.

There is a workaround:
* Manually ssh to the destination node, both as the root and nova users on the source node.
* Manually clear the offending known_hosts entries reported by the SSH command.
* Verify that once cleared, the root and nova users are able to successfully connect via SSH.

Obviously, this is cumbersome in the case of clouds with high numbers of compute nodes. It'd be better if the charm was able to avoid this issue.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-04-22:

Nova-cc has an action to redo all of the host keys when redeploying etc. Check out the "clear-unit-knownhost-cache" action. Also check whether hostname caching is on (config "cache-known-hosts=true") If this is set to true (the default) then changes in hosts or DNS resolution will result in stale information on the nova-compute units.

If it's neither of those things, then we have a bug.

Revision history for this message

Paul Goins (vultaire) wrote on 2022-04-22:

Thanks Alex - I feel like I've seen that action before but forgot about it.

Confirmed that cache-known-hosts=true. I'll see if the action fixes things; probably it will.

Revision history for this message

Paul Goins (vultaire) wrote on 2023-08-30:

Hello Alex - I found this bug in the wake of an issue we had on another cloud, and while it manifested in a slightly different way this time, the end result is the same: migrations failing because of issues with the SSH known_hosts file not being fully prepared to allow prompt-less SSH access.

First, let me say: I think the problem is *partially* addressed by the config change and action you mention. However, it wasn't enough for this particular cloud; I have evidence that improvements may be needed.

On this cloud, after the migration problem was reported to us, we set cache-known-hosts=false to turn off hostname caching, and followed that by the clear-unit-knownhost-cache action. And it looks like that works as expected. Here is a sanitized version of the output from the clear-unit-knownhost-cache action:

$ juju show-action-output 12345
UnitId: nova-cloud-controller/1
id: "97791"
results:
  Stderr: |
    # 10.1.2.15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # site2-rack3-node15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # 10.1.2.15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # site2-rack3-node15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    [...]
  units-updated: '[{''nova-compute-kvm/1'': ''<REDACTED>''}, [...]
status: completed
timing:
  completed: 2023-08-29 17:20:20 +0000 UTC
  enqueued: 2023-08-29 17:19:39 +0000 UTC
  started: 2023-08-29 17:19:39 +0000 UTC

We can see clearly that the script pulled the private-address IP and also the hostname and created entries against both - which is exactly what we want.

However, here's the nuance: the hostname doesn't match what's in "openstack hypervisor list" nor "openstack host list".

# Again, sanitized
$ openstack hypervisor list
+----+-------------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+-------------------------+-----------------+---------------+-------+
| 1 | site2-rack3-node15.maas | QEMU | 10.1.2.15 | up |
[...]
+----+-------------------------+-----------------+---------------+-------+

As you can see above, there's a .maas domain suffix. That wouldn't have been pre-seeded - and indeed, instance migrations fail without those entries since the hostname field in the relations don't match the hostnames used in OpenStack.

So - I think we have a bug here with regards to how hostnames are handled in the known_hosts file generation process.

First, let me say: I think the problem is *partially* addressed by the config change and action you mention.  However, it wasn't enough for this particular cloud; I have evidence that improvements may be needed.

On this cloud, after the migration problem was reported to us, we set cache-known-hosts=false to turn off hostname caching, and followed that by the clear-unit-knownhost-cache action.  And it looks like that works as expected.  Here is a sanitized version of the output from the clear-unit-knownhost-cache action:

We can see clearly that the script pulled the private-address IP and also the hostname and created entries against both - which is exactly what we want.

However, here's the nuance: the hostname doesn't match what's in "openstack hypervisor list" nor "openstack host list".

# Again, sanitized
$ openstack hypervisor list
+----+-------------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname     | Hypervisor Type | Host IP       | State |
+----+-------------------------+-----------------+---------------+-------+
|  1 | site2-rack3-node15.maas | QEMU            | 10.1.2.15     | up    |
[...]
+----+-------------------------+-----------------+---------------+-------+

As you can see above, there's a .maas domain suffix.  That wouldn't have been pre-seeded - and indeed, instance migrations fail without those entries since the hostname field in the relations don't match the hostnames used in OpenStack.

So - I think we have a bug here with regards to how hostnames are handled in the known_hosts file generation process.

Revision history for this message

Nishant Dash (dash3) wrote on 2023-09-21:

I can confirm I see this across multiple deployments
From what I understand, n-c-c is pulling hostname from relation data of the `--endpoint cloud-compute` which has plain hostnames whereas, nova is using the fqdn when performing commands during a resize for example

1. n-c-c endpoint
nova-compute/x:
        in-scope: true
        data:
          availability_zone: zone2
          egress-subnets: ip/32
          hostname: hostname

2. performing resize fails with Hostkey verification failure as such
oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\nCommand: scp -C -r hostname.maas:/var/lib/nova/instances/_b
ase/xyzw /var/lib/nova/instances/_base/abcdefgh\nExit code: 1\nStdout: \'\'\nStderr: \'Host key verification failed.

Since this scp command above is using the fqdn, this is exactly what Paul has outlined as both `hypervisor list` and `host list` use the fqdn.

Additionally, I see both focal ussuri and jammy yoga affected

Revision history for this message

Giuseppe Petralia (peppepetra) wrote on 2023-09-21 (last edit on 2023-09-21):

The issue described in comment #3 is affecting both:

- focal-ussuri (charm nova-cloud-controller ussuri/stable rev. 680)

- jammy-yoga (charm nova-cloud-controller yoga/stable rev. 634)

Nova-cloud-controller configures prompt-less SSH access only for hostname and private ip of each compute.

But then nova uses FQDN to do "scp" needed by resize and live-migrations. Resulting in both to fail.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2023-09-21:

Maybe this patch has to be ported over to NCC? https://github.com/openstack/charm-nova-compute/commit/2bad8a0522622e9da621a28912faa42efa27d033

Revision history for this message

Giuseppe Petralia (peppepetra) wrote on 2023-09-21 (last edit on 2023-09-21):

we have verified that host in nova.conf is using already the FQDN.

Also the issue occurs only when resizing or migrating VMs after the original image was deleted.

The error only occurs when the resize or migration includes the copy of the base file from the original host at

/var/lib/nova/instances/_base

When the original image is deleted from glance, nova fall back to copy it from host:

https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9452

and it uses instance.host as source, which is the FQDN of the compute node:

https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9620

And charm is not configuring prompt-less ssh for the FQDN

Revision history for this message

Giuseppe Petralia (peppepetra) wrote on 2023-09-22:

Update on this issue.

n-c-c is configuring the hosts keys for each node for the following entries:

* private-address of the compute node on the cloud-compute relation, which in our env is the internal space

* hostname on cloud-compute relation data (which is the hostname w/o domain)

* reverse lookup entry of the private-address that in maas environments return the fqdn with the interface name at the beginning:

  ```
  >>> import charmhelpers.contrib.openstack.utils as ch_utils
  >>> print(ch_utils.get_hostname("192.168.52.50"))
  bond0.123.my-host.maas
  ```

The correct entry is only returned for reverse lookup on the oam space which is the boot interface

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2023-10-06:

The nova-cloud-controller charm will create hostname, fqdn and ip address entries for each compute host. It does using settings 'private-address' and 'hostname' on the cloud-compute relation. private-address will be the address resolvable from libvirt-migration-network (if configured) otherwise the unit private-address.

Here comes the problem; the hostname added to known_hosts will be from relation 'hostname' BUT the hostname fqdn will be resolved from private-address. This means that if Nova compute is configured to use network X for the its management network and libvirt-migration-network is set to a different network, the fqdn in known_hosts will be from the latter. This is all good until nova-compute needs to do a vm resize and the image used to build the vm no longer exists in Glance. At which point Nova will use the instance.hostname from the database to perform an scp from source to destination and this fails because this hostname (fqdn from management network) is not in known_hosts.

This is something that Nova should ultimately have support for but in the interim the suggestion is that nova-cloud-controller always adds the management network fqdn to known_hosts.

Edward Hope-Morley (hopem) on 2023-10-09