Devstack gate launch jobs timeout confusing Jenkins and devstack gate
Bug #1204625 reported by
Clark Boylan
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Core Infrastructure |
Fix Released
|
High
|
Clark Boylan |
Bug Description
If the devstack gate vm launch jobs timeout it is possible for the devstack gate DB and Jenkins to get out of sync confusing them. These has led to more than one job being run on the same host and hosts being deleted out from under running jobs.
Relavent logs can be found at http://
It has been suggested that the devstack gate pool manager should be a daemon so that it can properly track state without needing to handle it across distinct processes (Jenkins jobs). I have also lowered the ready nodes numbers per d-g AZ to 15 to reduce the average number of slaves that must be spun up by d-g.
Changed in openstack-ci: | |
assignee: | nobody → James E. Blair (corvus) |
assignee: | James E. Blair (corvus) → nobody |
To post a comment you must log in.
The issue here was that after a timeout there may have been nodes added to Jenkins that were still marked BUILDING in the d-g database. When d-g attempted to add the BUILDING nodes into Jenkins any nodes already added to Jenkins would error and be deleted. By the time this happens jobs may have started running on that node resulting in all kinds of bad test failures.
This was corrected in https:/ /review. openstack. org/#/c/ 38674/ and the fix was to check the error returned when attempting to add a node to Jenkins and if the error was that the node already existed to ignore the error and continue processing that host. Eventually this would mark the node as READY which is the correct state.