Fuel for OpenStack

Raise max file descriptors and process file limits for corosync

Bug #1272840 reported by Matthew Mosesohn on 2014-01-26

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Roman Vyalov	Fuel for OpenStack 4.1

Bug Description

Corosync will hang if it runs out of file descriptors. We should increase our limits to reduce this from repeating.

Additionally, when corosync runs out of file descriptors, it spews a huge volume into /var/log/messages, which could create a worse problem for a running MySQL DB or Fuel Master if either runs out of disk space.

Tags:

Dmitry Pyzhov (dpyzhov) on 2014-02-03

Changed in fuel:
status:	New → Confirmed

Vladimir Kuklin (vkuklin) on 2014-02-03

Changed in fuel:
status:	Confirmed → Triaged
importance:	Undecided → High

Mike Scherbakov (mihgen) on 2014-02-04

Changed in fuel:
assignee:	Matthew Mosesohn (raytrac3r) → Fuel Library Team (fuel-library)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-05:

Is corosync the only service where this proved to be a problem?

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-05:

So far, yes, just Corosync encounters this issue.

Dmitry Borodaenko (angdraug) on 2014-02-05

tags:	added: ubuntu
tags:	removed: ubuntu

Dmitry Borodaenko (angdraug) on 2014-02-06

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Dmitry Borodaenko (dborodaenko)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-06:

0001-increase-file-descriptors-limit-in-the-init-script.patch Edit (857 bytes, text/plain)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-06:

OSCI-1043 raised to update corosync packages with a patched init script.

Revision history for this message

Pavel Vaylov (pvaylov) wrote on 2014-02-06:

Folks, we really need to have a fix for deployed environment and master node.

- This is blocker issue that may led to 50 seconds delay in running of nova commands.
- Also this issue may led to extremely high load on controller node.
- corosync instance affected by the issue become unusable

lsof -n -p $(pidof crmd) | wc -l report 1098

default limit 1024

Tried to increase limits on "hot"

su -m hacluster -c "ulimit -Sn 4096"
su -m hacluster -c "ulimit -Hn 10240"

But didn't get lucky.

Tried to edit /etc/security/limits.conf
added

hacluster soft nofile 4096
hacluster hard nofile 10240

then rebooted node

But didn't get lucky.

Workaround:

just insert "ulimit -n 1024000" in the start command in init script just before corosync starts

But we didn't test it.

One more addition: bug description does not contain string from crmd.log that crmd complaining about too much open files

2014-02-05T18:22:08.928963+00:00 err: error: qb_ipcs_us_connection_acceptor: Could not accept client connection: Too many open files (24)

Questions:

- Why the issue affected only one controller ?
- Is there a fix without restarting of any services ?

Roman Alekseenkov (ralekseenkov) on 2014-02-06

tags:

added: customer-found

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-06:

The limits issue is supposed to be fixed in libqb 0.16 according to this thread:
http://oss.clusterlabs.org/pipermail/pacemaker/2013-November/020087.html

We need to update corosync, pacemaker, and crmd in sync with libqb or else it won't work

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-06:

OSCI-1049 created to address upgrade.
The patch Dmitry B proposed to increase limits when launching crmd by updating corosync itself may prove useful if there are other FD leaks, but there is definitely a FD leak fix that was applied to libqb that got put in libqb 0.16 here:
https://github.com/ClusterLabs/libqb/commit/b327dbec7380e7de6896f9bb6cb1ca58677f4ed8

Dmitry Borodaenko (angdraug) on 2014-02-06

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-07:

We shouldn't be putting this in /etc/init.d/, but in /etc/default/corosync (ubuntu) and /etc/sysconfig/corosync (centos).

I did a manual test and both do seem to work, but from a packager's perspective, these customizations best belong in the vendor customization dir and not in initscript. Also, it wouldn't be a pain to configure this via puppet (as we probably should), instead of patching corosync.

Andrew Woodward (xarses) on 2014-02-07

Changed in fuel:
importance:	High → Critical

Vladimir Kuklin (vkuklin) on 2014-02-09

Changed in fuel:
importance:	Critical → High

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-10:

Guys, we need either update libqb along with corosync and pacemaker or cherry-pick the corresponding patch.

Vladimir Kuklin (vkuklin) on 2014-02-14

Changed in fuel:
assignee:	Dmitry Borodaenko (dborodaenko) → Roman Vyalov (r0mikiam)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-18:

#10

Fixed by update of libqb library

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

mauro (maurof) wrote on 2014-03-27:

#11

Hi all,
since I m running Fuel 4.0 and hardly can move to next releases now, can you kindly suggest me a "hot" workaround in order to fix the problem without patching??

According to your discussion(pavel) the only worth option according to my need is to :
1) edit "/etc/init.d/corosync"
2) add ""ulimit -n 1024000" line before any" start" command in the script.
3) restart corosync

can you confirm this workaround ? is there any another choice?

This bug heavily affects the system as each time it occurs , I need to restart in sequence all of the three controller-nodes to recover the situation

many thanks for the support

mauro

Andrew Woodward (xarses) on 2014-04-04

tags:

added: ha