[Ocata]resource tracker does not validate placement allocation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Invalid
|
Undecided
|
Unassigned | ||
Ocata |
Confirmed
|
Low
|
Unassigned |
Bug Description
For stable/ocata, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later.
The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful.
I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler
Thanks.
I checked and on stable/ocata nova ignores the error from placement in the reported case. So I made this confirmed for ocata. The same issue is not valid for newer branches. Ocata is in extended maintenance so the official project does not focus on fixing issues there but you can still persuade your OpenStack vendor to fix the problem upstream.