Fuel for OpenStack

Haproxy keepalives causing Keystone timeouts

Bug #1417317 reported by Dmitriy Novakovskiy on 2015-02-02

This bug report is a duplicate of: Bug #1413104: Large number of TIME_WAIT connections from haproxy cause stalled API service requests. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	New	Undecided	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

In one of customer deployments (MOS 5.0.1 w/ some packages installed from 5.1) we faced the following failure scenario:

- API Requests to Keystone start going into timeout
- Consequently, the queue of requests to other services (Cinder, Nova) accumulates, the cloud stops reacting on new requests and executing already placed ones

According to customer, the pattern of cloud usage prior to the issue happening is following: "Bear in mind that customer [meaning cloud end-users] is using the API heavily and they're deploying/destroying instances by bundles of 10s. That means 100s of consecutive requests."

Support team has enabled debug-level logging for Cinder and Keystone with the goal to research in more details what happens when issue occurs again. This never happened, since customer has addressed the issue by doing the following changes:

"we have fixed keystone issue by

a) haproxy configuration - disable keepalives
option http-server-close
b) disable keystone logging

This change has fixed both issues:

1) timeout of keystone
2) stalled connections of cinder-api/nova-api"

Customer has explained the rationale behind such change in following way:

"keepalive connections were getting broken after some number of processed requests - timing out, blocking connections. Actually there is no reason so use keepalives in our scenario. We need to balance request between controllers and keepalives were breaking that.

CPU usage immediately went down after the change as well."

So, my goal of filing this bug is:

1) Suggest QA team to build some load testing scenarios according to load pattern that customer described
2) Suggest Dev team to evaluate the case and feasibility of picking up the config change that helped customer resolve the issue (disable keepalives on HAproxy)

From Mirantis side, the following people have detailed technical context on this issue:
- Miroslav Anashkin <email address hidden>
- Tomasz Jaroszewski <email address hidden>
- Sergii Golovatiuk <email address hidden>

Tags:

Nastya Urlapova (aurlapova) on 2015-02-03