Comment 4 for bug 2021995

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

So I've taken a look at this, and I think there are two distinct possibilities:

1) Somehow the ilo was in an unknown/bad state, and in that entire sequence we somehow short cirucit out of ilo_common in the ironic drivers. This could also just be our interpretation of the internal wait code. https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/ilo/power.py

2) Or more likely, and a distinct issue in itself, the proliantutils code could hang on issuing the requests call.

https://opendev.org/x/proliantutils/src/branch/master/proliantutils/rest/v1.py#L161-L178

In that, there is no timeout value being set, and as there is no timeout, if the BMC connectivity has been lost at any point and the three way handshake for the socket was still initializing, then the call could be stuck forever which would align with the issue overall. If bandit was being run against proliantutils and it understood the way the handling was being used, it would be raising https://bandit.readthedocs.io/en/latest/plugins/b113_request_without_timeout.html as a warning on the code.

The lock would never be released and the node management would be halted until the conductor was restarted.

I believe the right course of action is to:

1) Add a explicit timeout into proliantutils for all api calls which utilize python-requests.
2) Add additional logging around any unexpected states being detected and potentially back-off the power sync/state change assertion at that point if detected.
3) Simplify the power status sync code... it just feels a bit complicated and makes it difficult debug.
4) Simplify the proliantutils/rest/v1.py file to also leverage explicit handlers (i.e. not use getattr) off of python-requests *AND* run bandit as part of CI.