Disabling management net on a single swift proxy node leads to a very long swift response time
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
6.0.x |
Won't Fix
|
High
|
MOS Maintenance | ||
Mirantis OpenStack |
Fix Released
|
High
|
Vladimir Kuklin |
Bug Description
Version: 6.1, ISO #474.
Full version available at http://
Steps to reproduce:
1. Install environment with Swift with 3 controllers and 1 compute node
2. Connect to some controller and disable management network here using the following command:
iptables -I INPUT -i br-mgmt -j DROP && iptables -I OUTPUT -o br-mgmt -j DROP
3. Connect to _another_ controller and execute 10 times command 'swift list' here.
Sometimes the command takes much time - more than a minute. On average, when it happens, response returns in 70 seconds. It might happen every time, or each 2nd or 3rd time, depending on circumstances I do not understand.
Analysis:
The issue occurs when haproxy sends user's request for Swift to the firewalled node. The Swift on that node tries to check user's token and times out because it can not connect to Keystone's admin url (which is on management net). Haproxy waits for response for 1 minute, and then resends the request to the other node. As a result, request takes slightly more than minute to be processed.
A similar issue would happen with other OpenStack components, but haproxy detects that all services on the node except Swift are dead. Haproxy detects services failure by accessing their endpoint, which listens on management (br-mgmt) network, which is firewalled. Swift's endpoint listens on storage interface (br-storage), so haproxy thinks that Swift is alive on the firewalled node.
In general, the problem is in haproxy health checks beeing too 'weak' - it is not enough to check that service's port is accessible. Probably we need to temporarily disable a service on a node if it constantly fails.
Attached is a snapshot of environment, in which management interface of one node was firewalled (node-2). You can see in haproxy log of node-1 how swift requests were handled. Also, in swift-proxy log of node-2 you can find swift trying to connect to keystone. The snapshot could be downloaded by the link: https:/
Changed in mos: | |
importance: | Undecided → High |
milestone: | none → 6.1 |
description: | updated |
description: | updated |
description: | updated |
tags: | added: low-hanging-fruit |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando) |
tags: | removed: low-hanging-fruit |
Changed in mos: | |
status: | New → Triaged |
importance: | Undecided → High |
assignee: | nobody → MOS Swift (mos-swift) |
milestone: | none → 7.0 |
summary: |
- Disabling management net on a single swift node leads to a very long - swift response time + Disabling management net on a single swift proxy node leads to a very + long swift response time |
Changed in fuel: | |
assignee: | Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin) |
Changed in fuel: | |
assignee: | Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando) |
Changed in mos: | |
assignee: | MOS Swift (mos-swift) → Fuel Library Team (fuel-library) |
assignee: | Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin) |
status: | Triaged → Fix Committed |
tags: | added: on-verification |
LIbrary people, please take a look into the issue, can you suggest a fix viable for 6.1? If not, I suggest to move issue to 7.0, as it is not very common failure scenario.