gdm leaking filehandles, causing "too many open files"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gdm (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
At a client's site (a large mining company here in Queensland) we have a Ubuntu 10.04 virtual machine running MacroView SCADA in a dr:bd high availability cluster. Workstations connect to the server via XDMCP for control of the local plant.
In the past they ran two discrete Ubuntu servers, which used to run reliably for months at a time. Since the switch to the high availability cluster, they now find gdm will refuse to accept new logins after about 6 weeks of operation until the gdm service is restarted. The restart of gdm has the effect of booting off everyone currently logged in -- so for a few seconds, everyone loses control of the processing plant while people re-connect, every 6 weeks.
The logs show the following:
Sep 20 08:23:38 cgmv1 gdm-binary[31050]: CRITICAL: could not add display to access file: Too many open files
Sep 20 08:23:38 cgmv1 gdm-binary[31050]: WARNING: Unable to set up access control for display 1691
Sep 20 08:23:38 cgmv1 gdm-binary[31050]: WARNING: GdmDisplay: display lasted 0.010690 seconds
Doing a search, it would appear this is a file-handle leaking bug reported to Red Hat back in February 2010:
https:/
Comment #2 of that bug has a patch that allegedly fixes the problem. I have hand-applied the patch against the latest stable source package of gdm (2.30.2.is.2.30.0), which I have attached here and am in the process of testing.
Okay, I can confirm this would appear to fix the problem.
Test procedure:
1. Log in remotely to the affected machine using ssh, run the following command:
$ sudo watch ls -l /proc/$( pidof gdm-binary )/fd
2. Start a remote X session with that host (through Xnest, plain X, whatever)... observe that new files are opened.
3. Close the X session, observe that the files are not removed from the list. In my case, this looked like this:
Vanilla Ubuntu gdm build.
After initial start-up.
total 0 gdm/auth- for-gdm- omI8J6/ database
lrwx------ 1 root root 64 2012-09-20 14:34 0 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:34 1 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:34 2 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:34 3 -> socket:[78942]
lr-x------ 1 root root 64 2012-09-20 14:34 4 -> pipe:[78945]
l-wx------ 1 root root 64 2012-09-20 14:34 5 -> pipe:[78945]
lr-x------ 1 root root 64 2012-09-20 14:34 6 -> inotify
lr-x------ 1 root root 64 2012-09-20 14:34 7 -> pipe:[78952]
l-wx------ 1 root root 64 2012-09-20 14:34 8 -> pipe:[78952]
lrwx------ 1 root root 64 2012-09-20 14:34 9 -> /var/run/
lr-x------ 1 root root 64 2012-09-20 14:34 10 -> pipe:[78971]
l-wx------ 1 root root 64 2012-09-20 14:34 11 -> pipe:[78971]
lr-x------ 1 root root 64 2012-09-20 14:34 12 -> pipe:[78937]
lrwx------ 1 root root 64 2012-09-20 14:34 13 -> socket:[78973]
lrwx------ 1 root root 64 2012-09-20 14:34 14 -> socket:[78974]
After a few logins...
total 0 gdm/auth- for-gdm- omI8J6/ database gdm/auth- for-gdm- bqKAEb/ database gdm/auth- for-gdm- QC37Jk/ database gdm/auth- for-vrtadmin- 4oPw6L/ database gdm/auth- for-gdm- ePg83T/ database gdm/auth- for-vrtadmin- YMcsem/ database
lrwx------ 1 root root 64 2012-09-20 14:38 0 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:38 1 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:38 2 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:38 3 -> socket:[78942]
lr-x------ 1 root root 64 2012-09-20 14:38 4 -> pipe:[78945]
l-wx------ 1 root root 64 2012-09-20 14:38 5 -> pipe:[78945]
lr-x------ 1 root root 64 2012-09-20 14:38 6 -> inotify
lr-x------ 1 root root 64 2012-09-20 14:38 7 -> pipe:[78952]
l-wx------ 1 root root 64 2012-09-20 14:38 8 -> pipe:[78952]
lrwx------ 1 root root 64 2012-09-20 14:38 9 -> /var/run/
lr-x------ 1 root root 64 2012-09-20 14:38 10 -> pipe:[78971]
l-wx------ 1 root root 64 2012-09-20 14:38 11 -> pipe:[78971]
lr-x------ 1 root root 64 2012-09-20 14:38 12 -> pipe:[78937]
lrwx------ 1 root root 64 2012-09-20 14:38 13 -> socket:[78973]
lrwx------ 1 root root 64 2012-09-20 14:38 14 -> socket:[78974]
lrwx------ 1 root root 64 2012-09-20 14:38 15 -> /var/run/
lrwx------ 1 root root 64 2012-09-20 14:38 16 -> /var/run/
lrwx------ 1 root root 64 2012-09-20 14:38 17 -> /var/run/
lrwx------ 1 root root 64 2012-09-20 14:38 18 -> /var/run/
lrwx------ 1 root root 64 2012-09-20 14:38 19 -> /var/run/
Now, build gdm with the given patch here, re-start gdm, and try again with the same procedure. In my case, I see:
Patched gdm build.
After initial start-up.
total 0
lrwx------ 1 root root 64 2012-09-20 14:49 0 -> /dev/null
lrwx------ 1 root root 64 2012-09-20 14:49 1 -> /dev/null
lrwx------ 1 root root 64 2012-09...