Ironic node stuck in locked state

Bug #2021995 reported by Scott Solkhon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Confirmed
Undecided
Jay Faulkner
Sushy
Confirmed
High
Unassigned
proliantutils
New
Undecided
Unassigned

Bug Description

If an Ironic node is stuck in a PXE loop, i.e the node is available but the power state has been left on for whatever reason the following will happen:

- Node boots
- Fails to get an IP / PXE (no active port in Neutron / DHCP)
- Node reboots
- Fails to get an IP / PXE
- and so on...

In Ironic the power status check runs and it sees that the node power state has changed so it tries to update it. I can see this in the logs:

May 25, 2023 @ 19:36:19.477 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:32:32.996 iLO failed to change state to power on within 600 sec for node <node-uuid>
May 25, 2023 @ 19:32:32.996 Failed to change power state of node <node-uuid> to 'power on', attempt 2 of 3.: ironic.common.exception.PowerStateFailure: Failed to set node power state to power on.
May 25, 2023 @ 19:32:32.995 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:23:13.880 Failed to change power state of node <node-uuid> to 'power on', attempt 1 of 3.: ironic.common.exception.PowerStateFailure: Failed to set node power state to power on.
May 25, 2023 @ 19:23:13.880 iLO failed to change state to power on within 600 sec for node <node-uuid>
May 25, 2023 @ 19:23:13.880 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:12:24.744 Successfully set node <node-uuid> power state to power on by power on.
May 25, 2023 @ 19:10:17.252 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:10:17.252 The node <node-uuid> operation of 'power on' is completed in 240 seconds.
May 25, 2023 @ 19:06:04.971 Successfully set node <node-uuid> power state to power on by power on.
May 25, 2023 @ 18:24:27.062 The node <node-uuid> operation of 'power on' is completed in 378 seconds.
May 25, 2023 @ 18:24:27.062 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 18:17:11.648 iLO failed to change state to power on within 600 sec for node <node-uuid>
May 25, 2023 @ 18:17:11.648 Failed to change power state of node <node-uuid> to 'power on', attempt 1 of 3.: ironic.common.exception.PowerStateFailure: Failed to set node power state to power on.

After some time this seemed to stop and the sync power state stopped running for a few days until I picked up this issue.
The output of the node on the CLI is:

+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| allocation_uuid | None |
| automated_clean | None |
| bios_interface | ilo |
| boot_interface | ilo-pxe |
| clean_step | {} |
| conductor | <redacted> |
| conductor_group | <redacted> |
| driver | ilo |
| instance_uuid | None |
| last_error | None |
| maintenance | False |
| maintenance_reason | None |
| management_interface | ilo |
| network_data | {} |
| network_interface | neutron |
| owner | None |
| power_interface | ilo |
| power_state | power on |
| provision_state | available |
| provision_updated_at | 2023-05-25T02:25:38+00:00 |
| rescue_interface | no-rescue |
| reservation | <redacted> <- there was an active reservation here |
| target_power_state | power on |
| target_provision_state | None |
| updated_at | 2023-05-25T18:32:01+00:00 |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

As this node is available + no maintenance Nova / Placement tries to build on it and I could see it fails in the Nova logs:

May 30, 2023 @ 16:55:26.365 Failed to reserve node <redacted> when provisioning the instance <redacted>: openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://<redacted>:6385/v1/nodes/<redacted>, Node <redacted> is locked by host <redacted>, please retry after the current operation is completed.
May 30, 2023 @ 16:55:26.364 [instance: <redacted>] Claim successful on node <redacted>

so, in summary:

- Node looks like its available
- Node is technically available although it keeps trying to PXE boot and it doesn't get anything from Neutron DHCP
- Node has a stale lock
- If we try and build an instance on the node we get an error but no state change in Ironic

I could disable the power status checks / auto power on or off when the value reported via the bmc != what Ironic thinks to prevent this happening, but I am also curious to how Ironic ended up in a situation where the lock was never released on this node. Also how it ended up powered on because it should be powered off at the end of a cleaning cycle. Maybe we could add a periodic task to check for stale locks and release them? I know that could open up a can of worms, but its something that should be considered.

The only way for me to release this lock was to restart the conductor.

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

I'm talking with Scott and trying to reproduce this.

Changed in ironic:
assignee: nobody → Jay Faulkner (jason-oldos)
status: New → Triaged
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

The logs are inconclusive as to if the power status loop kept running: logging settings are such that it only prints a log when something changes state, and the only node impacted by the power loop on that conductor historically was the one that failed here.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/ironic/+/885549

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

So I've taken a look at this, and I think there are two distinct possibilities:

1) Somehow the ilo was in an unknown/bad state, and in that entire sequence we somehow short cirucit out of ilo_common in the ironic drivers. This could also just be our interpretation of the internal wait code. https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/ilo/power.py

2) Or more likely, and a distinct issue in itself, the proliantutils code could hang on issuing the requests call.

https://opendev.org/x/proliantutils/src/branch/master/proliantutils/rest/v1.py#L161-L178

In that, there is no timeout value being set, and as there is no timeout, if the BMC connectivity has been lost at any point and the three way handshake for the socket was still initializing, then the call could be stuck forever which would align with the issue overall. If bandit was being run against proliantutils and it understood the way the handling was being used, it would be raising https://bandit.readthedocs.io/en/latest/plugins/b113_request_without_timeout.html as a warning on the code.

The lock would never be released and the node management would be halted until the conductor was restarted.

I believe the right course of action is to:

1) Add a explicit timeout into proliantutils for all api calls which utilize python-requests.
2) Add additional logging around any unexpected states being detected and potentially back-off the power sync/state change assertion at that point if detected.
3) Simplify the power status sync code... it just feels a bit complicated and makes it difficult debug.
4) Simplify the proliantutils/rest/v1.py file to also leverage explicit handlers (i.e. not use getattr) off of python-requests *AND* run bandit as part of CI.

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

There are two cases that could've occurred to cause this kind of failure:

1: a temporary network failure to the BMC could cause conductor to hang indefinitely due to lack of timeouts in proliantutils library (when connecting to remote BMCs over a network)

2: due to the reboot-looping nature of the node, it's possible we never reached the post state indicated here ==> https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/ilo/power.py#L140 and instead is returning some different status.

Fixing #1 will require significant changes in proliantutils, but should be undertaken as lack of timeouts can cause all kinds of lockups and failure scenarios that are unrecoverable without a conductor restart.

There is no real fix for #2 without specific information, if it's even the bug, but to help out with finding out if it's happening, https://review.opendev.org/c/openstack/ironic/+/885549 will begin logging the actual responses from iLO in the failure scenario so we can troubleshoot more deeply without reproducing the error on the hardware.

Another fix to attempt would be, if this is a new enough device (it probably is if you're iLO 9 or longer) would be to run the redfish driver instead, which properly implements timeouts to prevent bug #1, and uses standard states, which would prevent #2.

Changed in ironic:
status: Triaged → Confirmed
Revision history for this message
Mallikarjuna Reddy Kolagatla (mallikarjuna.kolagatla) wrote :

Any idea which Proliant hardware this issue is seen on? is it a Proliant Gen9 or Proliant Gen10 hardware?

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

I believe it's a Proliant Gen10 from my conversations with Scott. Not sure the exact model.

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

>>2) Or more likely, and a distinct issue in itself, the proliantutils code could hang on issuing the requests call.
>>https://opendev.org/x/proliantutils/src/branch/master/proliantutils/rest/v1.py#L161-L178
From preliminary investigation,

This code is used only for Proliant Gen9.
For Proliant Gen10, proliantutils bank on sushy for doing connection calls to the hardware. All DMTF implementations done earlier in sushy is directly used by proliantutils for Proliant Gen10. If any new implementation done in sushy in recent past , that may not have been updated in proliantutils.

Check this:
https://opendev.org/x/proliantutils/src/branch/master/proliantutils/redfish/connector.py.

So in this https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/ilo/power.py#L140, it is quite possible that the ironic node doesnt get power state as the ilo is looping for reboot before it completes the post state.

@Jayf, It will help if you could confirm the hardware(Gen9/Gen10) for the triage to conclude.

Revision history for this message
Aaron S (as0) wrote :

Hi Jay, Nisha -

I work with Scott can confirm this issue is observed on HP Gen10+ hardware - we haven't observed with Gen9 afaik.

best,
Aaron

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

@Aaron/@Scott,

Possible to come on IRC chat so that we can get detailed info if required?

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

@Aaron,

Are you saying that issue is seen with Gen10+ or we need to wait for Scott to confirm?

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Aaron, Scott and I can all be trusted for first-party information on this bug. It's a Gen 10 :)

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

Hi Scott/JayF,

Please let us know the iLO firmware version for the node so that we can initiate the discussion with firmware team.

regards
Nisha

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

I re-read the issue again. To me there are two problems:
- iLO fails to power ON sometimes. ----> Unable to reproduce the issue till now. iLO firmware version will probably help us to reproduce the issue. And is the issue reproducible always on that iLO version?
- Node reservation is still not over for some reason but node is in available state and hence placement tries to deploy it again and that fails with conflict saying that "Node is locked by <redacted>". This is an ironic issue.

Revision history for this message
Scott Solkhon (scott.solkhon) wrote :

@Nisha - Thanks for helping looking into this. The iLO firmware version for this particular node is 2.70. I can't say if its tied to a specific version as I haven't done a deep dive every time we have seen this same issue but generally all of the iLOs are running the same version in this environment so I imagine it could have happened to any of them.

Thanks,
Scott

Revision history for this message
Doug Szumski (dszumski) wrote :

Some observations on this, all strictly for Gen10 HW:

- I agree that all calls are going via Sushy, so we should get the 60 second default request timeout. I see in the debug logs "Using RedfishOperations for method ..." in all cases.
- If you enable debug logs, I see in the requests to the iLO `timeout: 60`
- Nodes which have got 'stuck' commonly seem to have HW issues (eg. critical health issues preventing POST, memory warnings which may affect POST). This isn't always the case, but it's probably not a coincidence.
- The iLo driver wait for state change loop, unlike for example, the redfish driver doesn't have a timeout, only a maximum number of retries. I.e it uses FixedIntervalLoopingCall. If something did get stuck in there, you could see how it could lock up, but since all requests appear to go via Sushy, and have a 60 second timeout set, it's not clear what /could/ lock up.

Thanks all for the discussion and the patch to improve logging. We can try applying it to get more data.

Revision history for this message
John Garbutt (johngarbutt) wrote :

So this is fun... not 100% sure, but I think the timeout in sushy is not used when blocking=false, i.e. we do not actually pass the timeout down to the requests library, AFAICT:
https://github.com/openstack/sushy/blob/f03d77ebcb7199fa08a82d5b34d7a6d677d86e0f/sushy/connector.py#L153

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

That's certainly a thing we should fix, and could be a cause of the observed behavior. Nice catch.

Changed in sushy:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/sushy 4.5.1

This issue was fixed in the openstack/sushy 4.5.1 release.

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

Hi Jayf, Scott,

The iLO firmware team is looking into this issue and requesting for AHS logs of such an iLO where the issue is hit. It would be great if you could attach the logs for their reference so that they can speed up on the fix.

(The retrying power on operation is just a workaround for this issue and doesn't fix the issue always.)

Regards
Nisha

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/sushy 4.1.6

This issue was fixed in the openstack/sushy 4.1.6 release.

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

I have created a workaround patch in proliantutils for this issue.
https://review.opendev.org/c/x/proliantutils/+/893827

The final fix for this issue will come from ilo firmware hopefully.

Revision history for this message
Nisha Agarwal (agarwalnisha1980) wrote :

Secondly, sushy workaround is a incomplete workaround and that incomplete fix also requires a fix.
https://bugs.launchpad.net/sushy/+bug/2028214

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.