Bacula Storage-daemon dies with segfault if a File-daemon can’t be contacted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
bacula (Ubuntu) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Binary package hint: bacula
Error message in daemon.log:
bacula-sd: Bacula interrupted by signal 11: Segmentation violation
Error messages given by the bacula-director:
Not accessible clients:
17-Aug 04:37 cxl05010-dir JobId 869: Warning: bsock.c:129 Could not connect to Client: CXW11010-fd on 192.168.11.10:9102. ERR=Connection timed out
Retrying ...
17-Aug 04:40 cxl05010-dir JobId 869: Fatal error: bsock.c:135 Unable to connect to Client: CXW11010-fd on 192.168.11.10:9102. ERR=Connection timed out
17-Aug 04:42 cxl05010-dir JobId 870: Warning: bsock.c:129 Could not connect to Client: CXW11011-fd on 192.168.11.11:9102. ERR=Connection timed out
Retrying ...
17-Aug 04:45 cxl05010-dir JobId 870: Fatal error: bsock.c:135 Unable to connect to Client: CXW11011-fd on 192.168.11.11:9102. ERR=Connection timed out
17-Aug 04:45 cxl05010-dir JobId 870: Error: openssl.c:86 TLS read/write failure.: ERR=error:
Subsequent backup-jobs which failed:
17-Aug 04:45 cxl05010-dir JobId 0: Warning: bsock.c:129 Could not connect to Storage daemon on cxb24010.
17-Aug 04:15 cxl05006-fd JobId 860: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe
17-Aug 04:50 cxl05010-dir JobId 871: Warning: bsock.c:129 Could not connect to Storage daemon on cxb24010.
17-Aug 05:15 cxl05010-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on cxb24010.
17-Aug 11:01 cxl05010-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on cxb24010.
This error occurs if one (or more) File-daemons can’t be contacted by the bacula-director.
The Storage-daemon dies when the job whose file-daemon is unreachable is canceled
by the director.
The biggest problem is that one client whose file-daemon cant’t be reached will crash the storage-daemon and all the running and subsequent backups will fail.
The segfault did occur several times, twice because the IP-address of the client did change, while
is wasn’t changed in the bacula-fd.conf and once when the client was shut-down.
In terms of performance there should be no problems as the raid-array could handle a throughput to disk of 650Mbyte/s . The network-capacity should be no problem either, as the storage-server got a 10gbit connection, the clients a 1gbit one. Cpu- and memory load (on the storage-server) is low too (about 10% of one cpu-core per backup-job, about 900mb of ram-usage).
To make sure it is not a temporary issue, the bacula-director and the storage-daemon were restarted.
To make sure it is not an issue depending on the Number of concurrent jobs we tested it with three jobs at a time.
To make sure it is not a hardware related issue another storage-deamon system was installed and tested.
We could always re-produce the crash of the storage-daemon on other hardware under similar conditions.
Additional Info:
ProblemType: Bug
Uname: 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:28:05 UTC 2010 x86_64
Architecture: x86_64
Package the bug was found in: bacula-sd (5.0.1-1ubuntu1)
SourcePackage: bacula
Release: Ubuntu 10.04 LTS
Changed in bacula (Ubuntu): | |
importance: | Undecided → Medium |
Changed in bacula (Ubuntu): | |
status: | New → Confirmed |
I see the same behavior on our bacula server:
Dec 21 01:24:54 wopr bacula-sd: Bacula interrupted by signal 11: Segmentation violation sd[400000+ 53000] sd[400000+ 53000]
Dec 21 01:24:54 wopr kernel: [904270.954102] __ratelimit: 6 callbacks suppressed
Dec 21 01:24:54 wopr kernel: [904270.954107] bacula-sd[18266] general protection ip:43222b sp:7f6717d9e650 error:0 in bacula-
Dec 21 01:24:54 wopr kernel: [904270.954692] bacula-sd[15837] general protection ip:43222b sp:7f671559a650 error:0 in bacula-
21-Dec 00:49 wopr-dir JobId 13387: Fatal error: Network error with FD during Backup: ERR=Connection timed out 2010-12- 20_22.02. 11_16" marked to be canceled.
21-Dec 00:49 wopr.sepaq.com-sd JobId 13387: JobId=13387 Job="mars.
21-Dec 00:49 wopr-dir JobId 13387: Fatal error: No Job status returned from FD.
21-Dec 01:24 jabba-fd JobId 13393: Fatal error: backup.c:1019 Network send error to SD. ERR=Broken pipe
21-Dec 01:55 wopr-dir JobId 13393: Error: Bacula wopr-dir 5.0.1 (24Feb10): 21-Dec-2010 01:55:05
21-Dec 01:24 wopr.sepaq.com-sd: ERROR in lock.c:268 Failed ASSERT: dev->blocked()
21-Dec 01:24 gaia-fd JobId 13381: Fatal error: backup.c:1019 Network send error to SD. ERR=Broken pipe
This is on 10.04 x86_64 bacula 5.0.1-1ubuntu1