nova compute service does not reset instance with task_state in rebooting_hard

Bug #1999674 reported by Pierre-Samuel LE STANG
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned

Bug Description

Description
===========
When a user ask for a reboot hard of a running instance while nova compute is unavailable (service stopped or host down) it might happens under certain conditions that the instance stays in rebooting_hard task_state after nova-compute start again.

The condition to get this issue is to have a rabbitmq message-ttl of messages in queue which is lower than the time needed to get nova compute up again.

Steps to reproduce
==================

Prerequisites:
* Set a low message-ttl (let's say 60 seconds) in your rabbitmq
* Have a running instance on a host

First case is having a failure on nova-compute service
1/ stop nova compute service on host
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
4/ start nova compute service
5/ check instance task_state and status

Second case is having a failure on the host
1/ hard shutdown the host (let's say a power supply issue)
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
2/ restart the host
5/ check instance task_state and status

Expected result
===============
We expect nova compute to be able to reset the state to active as we lost the message, to let the user take some other actions on the instance.

Actual result
=============
The instance is stuck in rebooting_hard task_state, user is blocked

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/867807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/867832

Changed in nova:
status: New → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

if we know the comput serrvice is down we probably should reject this in the API.

the message bus diconnects we expect the mandatory flag to detect that at the rpc level.

i have not triaged this fully but I'm not sure resetinng the state is the correct approch

Revision history for this message
Pierre-Samuel LE STANG (pslestang) wrote :

We don't have the compute service status in real time so it's hard du rely on it

REBOOTING_HARD is also a transient status so it makes sense to handle this status with other transient status.

Revision history for this message
Arnaud Morin (arnaud-morin) wrote :

The message bus disconnection appears only after a timeout, so nova-compute will be reported down only after a defined period of time.

If an API call to request a reboot hard is done in the middle of this, the message is sent by nova to nova-compute in the message bus.

But if the message TTL in too short, it can be dropped by queue system (rabbit) before nova-compute is up again.

In that scenario, the only possible action is to reset-state the instance from an admin context.

We (OVHcloud) are going to have this patch downstream, but we think this would be nice to consider having it upstream as well.

I dont think we can rely on the fact that nova-api will know that nova-compute is down or the message bus is disconnected because this may not be always true.

Moreover, we already reset state when status is PAUSING, UNPAUSING, etc., why not for REBOOTING_HARD?

Is there any better approach you can see?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.