Raise max file descriptors and process file limits for corosync

Bug #1272840 reported by Matthew Mosesohn
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Roman Vyalov

Bug Description

Corosync will hang if it runs out of file descriptors. We should increase our limits to reduce this from repeating.

Additionally, when corosync runs out of file descriptors, it spews a huge volume into /var/log/messages, which could create a worse problem for a running MySQL DB or Fuel Master if either runs out of disk space.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: New → Confirmed
Changed in fuel:
status: Confirmed → Triaged
importance: Undecided → High
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Fuel Library Team (fuel-library)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Is corosync the only service where this proved to be a problem?

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

So far, yes, just Corosync encounters this issue.

tags: added: ubuntu
tags: removed: ubuntu
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

OSCI-1043 raised to update corosync packages with a patched init script.

Revision history for this message
Pavel Vaylov (pvaylov) wrote :

Folks, we really need to have a fix for deployed environment and master node.

- This is blocker issue that may led to 50 seconds delay in running of nova commands.
- Also this issue may led to extremely high load on controller node.
- corosync instance affected by the issue become unusable

lsof -n -p $(pidof crmd) | wc -l report 1098

default limit 1024

Tried to increase limits on "hot"

su -m hacluster -c "ulimit -Sn 4096"
su -m hacluster -c "ulimit -Hn 10240"

But didn't get lucky.

Tried to edit /etc/security/limits.conf
added

hacluster soft nofile 4096
hacluster hard nofile 10240

then rebooted node

But didn't get lucky.

Workaround:

just insert "ulimit -n 1024000" in the start command in init script just before corosync starts

But we didn't test it.

One more addition: bug description does not contain string from crmd.log that crmd complaining about too much open files

2014-02-05T18:22:08.928963+00:00 err: error: qb_ipcs_us_connection_acceptor: Could not accept client connection: Too many open files (24)

Questions:

 - Why the issue affected only one controller ?
 - Is there a fix without restarting of any services ?

tags: added: customer-found
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

The limits issue is supposed to be fixed in libqb 0.16 according to this thread:
http://oss.clusterlabs.org/pipermail/pacemaker/2013-November/020087.html

We need to update corosync, pacemaker, and crmd in sync with libqb or else it won't work

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

OSCI-1049 created to address upgrade.
The patch Dmitry B proposed to increase limits when launching crmd by updating corosync itself may prove useful if there are other FD leaks, but there is definitely a FD leak fix that was applied to libqb that got put in libqb 0.16 here:
https://github.com/ClusterLabs/libqb/commit/b327dbec7380e7de6896f9bb6cb1ca58677f4ed8

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

We shouldn't be putting this in /etc/init.d/, but in /etc/default/corosync (ubuntu) and /etc/sysconfig/corosync (centos).

I did a manual test and both do seem to work, but from a packager's perspective, these customizations best belong in the vendor customization dir and not in initscript. Also, it wouldn't be a pain to configure this via puppet (as we probably should), instead of patching corosync.

Andrew Woodward (xarses)
Changed in fuel:
importance: High → Critical
Changed in fuel:
importance: Critical → High
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Guys, we need either update libqb along with corosync and pacemaker or cherry-pick the corresponding patch.

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Roman Vyalov (r0mikiam)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Fixed by update of libqb library

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
mauro (maurof) wrote :

Hi all,
  since I m running Fuel 4.0 and hardly can move to next releases now, can you kindly suggest me a "hot" workaround in order to fix the problem without patching??

According to your discussion(pavel) the only worth option according to my need is to :
1) edit "/etc/init.d/corosync"
2) add ""ulimit -n 1024000" line before any" start" command in the script.
3) restart corosync

can you confirm this workaround ? is there any another choice?

This bug heavily affects the system as each time it occurs , I need to restart in sequence all of the three controller-nodes to recover the situation

many thanks for the support

mauro

Andrew Woodward (xarses)
tags: added: ha
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.