Resource cleanup fails to properly handle broken resources.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Rally |
Triaged
|
Low
|
Unassigned |
Bug Description
In situation when Openstack is unable to remove resource/remove it quickly enough after performing a test, rally invokes silent 600 seconds wait, and then proceeds as if nothing happened. Next invocations of "rally task start" also wait for 600 seconds and time out, without any message whatsoever.
Recently we had a situation when VM created by rally tests got stuck in "Deleting" state, and neither rally nor OpenStack cmdline tools were unable to remove the VM. We run rally periodically, and it caused following executions of "rally task start" to wait for 600 seconds timeout before they reported success. Before we cleaned up the instance in OpenStack database, we investigated rally a bit, and it seems that there is a cleanup procedure that tries to remove all instances in tenant used by rally. What is more, this procedure has hardcoded timeout of 600s (i think it's rally/benchmark
Expected behaviour would be to:
a) expose this timeout somewhere, so that user can configure it in a way that suits his environment
b) resources that rally fails to delete should be renamed to something meaningfull (i.e. failed_
c) there should be a way to limit the number of "unremovable" resources, defined in config file, which will prevent rally from creating too many of them. If the number of renamed resources reaches treshold, rally should fail the task IMO.
Thanks
pr
Changed in rally: | |
importance: | Undecided → Low |
milestone: | none → 0.1.0 |
Changed in rally: | |
milestone: | 0.1.0 → none |
Changed in rally: | |
status: | New → Triaged |
Changed in rally: | |
status: | Triaged → Fix Committed |
status: | Fix Committed → New |
Changed in rally: | |
status: | New → Triaged |
We also encountered a similar situation. We have a set of scenarios which we run using a deployment with an existing tenant and users registered. We will often run these scenarios multiple times over night. Sometimes, during the testing, a number of servers are created which are in error and cannot be deleted via Rally or OpenStack CLI. Because they were created by the tenant registered with the Rally deployment, any subsequent scenarios which use that tenant and specify Nova in their cleanup context will attempt to delete these "bad" servers; this addition to the cleanup process has often added 30 minutes or more to scenarios which take less than a minute to execute.
Is there some way in the existing Rally source or configurations to localize the cleanup process to only resources created by the current scenario? If not, I suggest that this would be a nice feature.
Best,
Chris Kirkland