Ironic node stuck in locked state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Confirmed
|
Undecided
|
Jay Faulkner | ||
Sushy |
Confirmed
|
High
|
Unassigned | ||
proliantutils |
New
|
Undecided
|
Unassigned |
Bug Description
If an Ironic node is stuck in a PXE loop, i.e the node is available but the power state has been left on for whatever reason the following will happen:
- Node boots
- Fails to get an IP / PXE (no active port in Neutron / DHCP)
- Node reboots
- Fails to get an IP / PXE
- and so on...
In Ironic the power status check runs and it sees that the node power state has changed so it tries to update it. I can see this in the logs:
May 25, 2023 @ 19:36:19.477 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:32:32.996 iLO failed to change state to power on within 600 sec for node <node-uuid>
May 25, 2023 @ 19:32:32.996 Failed to change power state of node <node-uuid> to 'power on', attempt 2 of 3.: ironic.
May 25, 2023 @ 19:32:32.995 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:23:13.880 Failed to change power state of node <node-uuid> to 'power on', attempt 1 of 3.: ironic.
May 25, 2023 @ 19:23:13.880 iLO failed to change state to power on within 600 sec for node <node-uuid>
May 25, 2023 @ 19:23:13.880 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:12:24.744 Successfully set node <node-uuid> power state to power on by power on.
May 25, 2023 @ 19:10:17.252 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 19:10:17.252 The node <node-uuid> operation of 'power on' is completed in 240 seconds.
May 25, 2023 @ 19:06:04.971 Successfully set node <node-uuid> power state to power on by power on.
May 25, 2023 @ 18:24:27.062 The node <node-uuid> operation of 'power on' is completed in 378 seconds.
May 25, 2023 @ 18:24:27.062 During sync_power_state, node <node-uuid> state 'power off' does not match expected state. Changing hardware state to 'power on'.
May 25, 2023 @ 18:17:11.648 iLO failed to change state to power on within 600 sec for node <node-uuid>
May 25, 2023 @ 18:17:11.648 Failed to change power state of node <node-uuid> to 'power on', attempt 1 of 3.: ironic.
After some time this seemed to stop and the sync power state stopped running for a few days until I picked up this issue.
The output of the node on the CLI is:
+------
| Field | Value |+-----
| allocation_uuid | None |
| automated_clean | None |
| bios_interface | ilo |
| boot_interface | ilo-pxe |
| clean_step | {} |
| conductor | <redacted> |
| conductor_group | <redacted> |
| driver | ilo |
| instance_uuid | None |
| last_error | None |
| maintenance | False |
| maintenance_reason | None |
| management_
| network_data | {} |
| network_interface | neutron |
| owner | None |
| power_interface | ilo |
| power_state | power on |
| provision_state | available |
| provision_
| rescue_interface | no-rescue |
| reservation | <redacted> <- there was an active reservation here |
| target_power_state | power on |
| target_
| updated_at | 2023-05-
+------
As this node is available + no maintenance Nova / Placement tries to build on it and I could see it fails in the Nova logs:
May 30, 2023 @ 16:55:26.365 Failed to reserve node <redacted> when provisioning the instance <redacted>: openstack.
May 30, 2023 @ 16:55:26.364 [instance: <redacted>] Claim successful on node <redacted>
so, in summary:
- Node looks like its available
- Node is technically available although it keeps trying to PXE boot and it doesn't get anything from Neutron DHCP
- Node has a stale lock
- If we try and build an instance on the node we get an error but no state change in Ironic
I could disable the power status checks / auto power on or off when the value reported via the bmc != what Ironic thinks to prevent this happening, but I am also curious to how Ironic ended up in a situation where the lock was never released on this node. Also how it ended up powered on because it should be powered off at the end of a cleaning cycle. Maybe we could add a periodic task to check for stale locks and release them? I know that could open up a can of worms, but its something that should be considered.
The only way for me to release this lock was to restart the conductor.
Changed in ironic: | |
status: | Triaged → Confirmed |
I'm talking with Scott and trying to reproduce this.