1 Qemu related problems (for example hung during coping) .
2 After migration nailgun begins answering (i. e. to work). But not all nodes have time to update their status before test starts
3 Busy indicator /tmp/notready is present only last couple minutes of whole migration time.
So it was added new flag /tmp/migration-done which appears exactly when migration process is finished and nailgun on cone starts working.
But after that, additionally in tests code is necessary to add checking of node status. Inside script itself it is impossible because cluster configuration and current state may be different. During testing configuration of cluster is known
It allows avoid 2 and 3 cases.
According 1 there is only common recommendation - update software on Jenkins slave host, increase memory and cpus on compute node
I've look at all "red" jobs logs.
There are 3 main causes of failure:
1 Qemu related problems (for example hung during coping) .
2 After migration nailgun begins answering (i. e. to work). But not all nodes have time to update their status before test starts
3 Busy indicator /tmp/notready is present only last couple minutes of whole migration time.
So it was added new flag /tmp/migration-done which appears exactly when migration process is finished and nailgun on cone starts working.
But after that, additionally in tests code is necessary to add checking of node status. Inside script itself it is impossible because cluster configuration and current state may be different. During testing configuration of cluster is known
It allows avoid 2 and 3 cases.
According 1 there is only common recommendation - update software on Jenkins slave host, increase memory and cpus on compute node