I don't think this properly addresses the problem. The root issue here is that in a multi-node task if one node fails all nodes in the task are marked as failed. It happens when the task it's self fails too. In the event of a task being run on a production cloud it sets the entire cloud to failed. After this the orchestrator want's to re-run all task on all nodes to resolve it.
This is further compounded by the start of a task removing the pending state, not the completion.
Bottom line, only the node(s) failed in a task should be marked as error, and only the not completed tasks should be identified to run the next time changes are deployed.
I don't think this properly addresses the problem. The root issue here is that in a multi-node task if one node fails all nodes in the task are marked as failed. It happens when the task it's self fails too. In the event of a task being run on a production cloud it sets the entire cloud to failed. After this the orchestrator want's to re-run all task on all nodes to resolve it.
This is further compounded by the start of a task removing the pending state, not the completion.
Bottom line, only the node(s) failed in a task should be marked as error, and only the not completed tasks should be identified to run the next time changes are deployed.