2020-01-28 07:18:17 |
Yang Youseok |
description |
For ocata/stein, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later.
The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful.
I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler/client/report.py _allocate_for_instance()) to catch the 409 conflict similar to the function latter added (put_allocations())
Thanks. |
For stable/ocata, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later.
The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful.
I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler/client/report.py _allocate_for_instance()) to catch the 409 conflict similar to the function latter added (put_allocations())
Thanks. |
|