By following the steps indicated in the bug description above it was possible to reproduce the issue in an AIO-DX environment, according to the following timeline (at the host the pod(s) was(were) scheduled on):
t=0s: Finished controller manifest
t=8s: Started worker manifest
t=37s: Start of k8s-pod-recovery
t=38s: Finished worker manifest
t=63s: Started created "restart-on-reboot" labeled pod(s)
t=281s: Same labeled pod(s) verified w/o restarting
The restart of the pod(s) is not performed because the query on the labeled pods to be recovered returns an empty set when the k8s-pod-recovery is launched.
By moving the handling of labeled pods to after they are in a stable state, the restart of them is correctly performed:
t=0s: Finished controller manifest
t=9s: Started worker manifest
t=66s: Start of k8s-pod-recovery
t=67s: Finished worker manifest
t=73s: Started created "restart-on-reboot" labeled pod(s)
t=190s: Labeled pod(s) is(are) restarted
t=408s: New labeled pod(s) verified
By following the steps indicated in the bug description above it was possible to reproduce the issue in an AIO-DX environment, according to the following timeline (at the host the pod(s) was(were) scheduled on):
t=0s: Finished controller manifest
t=8s: Started worker manifest
t=37s: Start of k8s-pod-recovery
t=38s: Finished worker manifest
t=63s: Started created "restart-on-reboot" labeled pod(s)
t=281s: Same labeled pod(s) verified w/o restarting
The restart of the pod(s) is not performed because the query on the labeled pods to be recovered returns an empty set when the k8s-pod-recovery is launched.
By moving the handling of labeled pods to after they are in a stable state, the restart of them is correctly performed:
t=0s: Finished controller manifest
t=9s: Started worker manifest
t=66s: Start of k8s-pod-recovery
t=67s: Finished worker manifest
t=73s: Started created "restart-on-reboot" labeled pod(s)
t=190s: Labeled pod(s) is(are) restarted
t=408s: New labeled pod(s) verified