Haproxy keepalives causing Keystone timeouts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
New
|
Undecided
|
Fuel Library (Deprecated) |
Bug Description
In one of customer deployments (MOS 5.0.1 w/ some packages installed from 5.1) we faced the following failure scenario:
- API Requests to Keystone start going into timeout
- Consequently, the queue of requests to other services (Cinder, Nova) accumulates, the cloud stops reacting on new requests and executing already placed ones
According to customer, the pattern of cloud usage prior to the issue happening is following: "Bear in mind that customer [meaning cloud end-users] is using the API heavily and they're deploying/
Support team has enabled debug-level logging for Cinder and Keystone with the goal to research in more details what happens when issue occurs again. This never happened, since customer has addressed the issue by doing the following changes:
"we have fixed keystone issue by
a) haproxy configuration - disable keepalives
option http-server-close
b) disable keystone logging
This change has fixed both issues:
1) timeout of keystone
2) stalled connections of cinder-
Customer has explained the rationale behind such change in following way:
"keepalive connections were getting broken after some number of processed requests - timing out, blocking connections. Actually there is no reason so use keepalives in our scenario. We need to balance request between controllers and keepalives were breaking that.
CPU usage immediately went down after the change as well."
So, my goal of filing this bug is:
1) Suggest QA team to build some load testing scenarios according to load pattern that customer described
2) Suggest Dev team to evaluate the case and feasibility of picking up the config change that helped customer resolve the issue (disable keepalives on HAproxy)
From Mirantis side, the following people have detailed technical context on this issue:
- Miroslav Anashkin <email address hidden>
- Tomasz Jaroszewski <email address hidden>
- Sergii Golovatiuk <email address hidden>
Changed in fuel: | |
assignee: | nobody → Fuel Library Team (fuel-library) |
milestone: | none → 6.1 |
Looks like it's indeed a duplicate of https:/ /bugs.launchpad .net/bugs/ 1413104.
But what about coverage w/ tests? It's kinda embarrassing to detect such problem on a relatively small (15 nodes) customer deployment