Comment 4 for bug 1237688

Revision history for this message
aeva black (tenbrae) wrote :

Ruby,

Good questions. I'll add some of my current thoughts --

> if it isn't possible to transition to a different power state, we should leave the power state as it had been,

Do we set the power state back to what it was? If so, how do we know what it was?
Or, do we set the power state to what ever it actually is when the attempted transition is over?

> unless the transition may have caused the state to be in some indeterminate state.

Hardware is flaky; assume we may get timeouts and unexpected responses from get_power_state()

> In this particular case of an invalid target power state, the power state should remain what it had been set to, since it didn't change.

Yes, but then how do we inform the user of the failed input? Remember, the RPC call for set_power_state is cast -- so it is asynchronous, so the API will always return "202 in progress". If, upon processing the invalid target power state, the conductor sets the target power state back to NOSTATE, and sets the power state to what it is (or was), then the user will never know there was an invalid request.

> wrt target_power_state: if the transaction finished (either successfully or not), it should be reset to NULL.

If the transition failed (finished unsuccessfully) how do we inform the user of this?

> But what happens in the case where the target_power_state is power_on, and there is a timeout waiting for the power to be turned on? Do we reset it, or leave it as power_on?

Another good question!

So, perhaps we just need to add a "last_error" field to the node table, and change how we handle ERROR state slightly? Eg:

* power_state -- always represents current power state. Any power operation sets this back to "actual" when done (whether successful or not). Also, we should add a periodic_task to sync this. Set to ERROR only when unable to get the power state from a node.
* target_power_state -- represents the requested destination of a state transition. Cleared when the transition window is over (whether successful or not).
* last_error -- general string field used to store the last error from any requested operation (eg, whether that was to change power state, or deploy a node, or anything else). Cleared when any new operation is started.

I'll step through a few example cases:

* request to change to valid state: 1. target_power_state set; 2. driver called and performs change; 3. driver returns new state; 4. manager saves it in db.power_state and clears target_power_state.

* request to change to valid state, but fails: 1. same; 2. same; 3. driver raises exception; 4. manager saves it in last_error, calls driver.get_power_state and sets power_state, then clears target_power_state

* request to change to invalid state: 1. target power state set to invalid state; 2. driver called, but raises error; 3. manager copies error to last_error field, clears target_power_state, and calls driver.get_power_state to reset actual power_state field.

* request to change when current state is ERROR: 1. manager calls driver.get_power_state; 2. if successful, driver sets power_state and proceeds as per any other request; if unsuccessful, manager sets last_error, eg, to "unable to change power state: failed to get current power state: %(exception)s"

How does this approach sound? Did I miss anything?