OpenStack Compute (nova)

Comment 17 for bug 1837877

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-07-25: Re: Error message reveals ceph information

#17

Ugh, OK. So in this particular case, the instance goes to ERROR state from the reboot failure here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/manager.py#L3537

The exception is re-raised and the fault (libvirtError) gets recorded by the wrap_instance_fault decorator:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/manager.py#L3448

This is what sets the message field on the InstanceFault:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L109

This is likely where the libvirtError gets turned into the message:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L78

Trying to play whack-a-mole with all of the places that could raise unexpected exceptions from lower layers, like the hypervisor in this case, is going to be a losing game. We should probably just restrict the fault message to admin-only since we don't audit them and this is python so explicitly handling typed exceptions isn't a thing.

Deciding *where* to obfuscate the message is the question I think. We could say that for any non-NovaException types (where format_message() is implemented), we could just include the message if context.is_admin here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L78

But that won't fix old compute nodes until you patch them. It also doesn't mean that NovaExceptions handled here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L73

Wouldn't also expose something from a lower level because we have several exceptions like this:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/exception.py#L267

Where whatever lower level error happens gets turned into the "reason" value.

Given that, it probably makes most sense to just add a new policy rule that is checked in the API to determine if the fault message can be shown here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/api/openstack/compute/views/servers.py#L553

And the policy rule would default to admin-only. A policy rule is better since deployments could tailor it so non-admin support personal could still see those details but not the tenant user that owns the server.

If we do that, I'd also fix the logic here so we rely on policy rather than this janky code logic:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/api/openstack/compute/views/servers.py#L562

Ugh, OK. So in this particular case, the instance goes to ERROR state from the reboot failure here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/manager.py#L3537

The exception is re-raised and the fault (libvirtError) gets recorded by the wrap_instance_fault decorator:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/manager.py#L3448

This is what sets the message field on the InstanceFault:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L109

This is likely where the libvirtError gets turned into the message:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L78

But that won't fix old compute nodes until you patch them. It also doesn't mean that NovaExceptions handled here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/compute/utils.py#L73

Wouldn't also expose something from a lower level because we have several exceptions like this:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/exception.py#L267

Where whatever lower level error happens gets turned into the "reason" value.

Given that, it probably makes most sense to just add a new policy rule that is checked in the API to determine if the fault message can be shown here:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/api/openstack/compute/views/servers.py#L553

If we do that, I'd also fix the logic here so we rely on policy rather than this janky code logic:

https://github.com/openstack/nova/blob/2c0cb71fb0ac0d502dc9fed24211e1ef15407b8f/nova/api/openstack/compute/views/servers.py#L562