Haproxy keepalives causing Keystone timeouts

Bug #1417317 reported by Dmitriy Novakovskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
New
Undecided
Fuel Library (Deprecated)

Bug Description

In one of customer deployments (MOS 5.0.1 w/ some packages installed from 5.1) we faced the following failure scenario:

- API Requests to Keystone start going into timeout
- Consequently, the queue of requests to other services (Cinder, Nova) accumulates, the cloud stops reacting on new requests and executing already placed ones

According to customer, the pattern of cloud usage prior to the issue happening is following: "Bear in mind that customer [meaning cloud end-users] is using the API heavily and they're deploying/destroying instances by bundles of 10s. That means 100s of consecutive requests."

Support team has enabled debug-level logging for Cinder and Keystone with the goal to research in more details what happens when issue occurs again. This never happened, since customer has addressed the issue by doing the following changes:

"we have fixed keystone issue by

a) haproxy configuration - disable keepalives
  option http-server-close
b) disable keystone logging

This change has fixed both issues:

1) timeout of keystone
2) stalled connections of cinder-api/nova-api"

Customer has explained the rationale behind such change in following way:

"keepalive connections were getting broken after some number of processed requests - timing out, blocking connections. Actually there is no reason so use keepalives in our scenario. We need to balance request between controllers and keepalives were breaking that.

CPU usage immediately went down after the change as well."

So, my goal of filing this bug is:

1) Suggest QA team to build some load testing scenarios according to load pattern that customer described
2) Suggest Dev team to evaluate the case and feasibility of picking up the config change that helped customer resolve the issue (disable keepalives on HAproxy)

From Mirantis side, the following people have detailed technical context on this issue:
- Miroslav Anashkin <email address hidden>
- Tomasz Jaroszewski <email address hidden>
- Sergii Golovatiuk <email address hidden>

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 6.1
Revision history for this message
Dmitriy Novakovskiy (dnovakovskiy) wrote :

Looks like it's indeed a duplicate of https://bugs.launchpad.net/bugs/1413104.

But what about coverage w/ tests? It's kinda embarrassing to detect such problem on a relatively small (15 nodes) customer deployment

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.