Bacula Storage-daemon dies with segfault if a File-daemon can’t be contacted

Bug #622742 reported by Philipp Lindt
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
bacula (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Binary package hint: bacula

Error message in daemon.log:
bacula-sd: Bacula interrupted by signal 11: Segmentation violation

Error messages given by the bacula-director:

Not accessible clients:
17-Aug 04:37 cxl05010-dir JobId 869: Warning: bsock.c:129 Could not connect to Client: CXW11010-fd on 192.168.11.10:9102. ERR=Connection timed out
Retrying ...
17-Aug 04:40 cxl05010-dir JobId 869: Fatal error: bsock.c:135 Unable to connect to Client: CXW11010-fd on 192.168.11.10:9102. ERR=Connection timed out

17-Aug 04:42 cxl05010-dir JobId 870: Warning: bsock.c:129 Could not connect to Client: CXW11011-fd on 192.168.11.11:9102. ERR=Connection timed out
Retrying ...
17-Aug 04:45 cxl05010-dir JobId 870: Fatal error: bsock.c:135 Unable to connect to Client: CXW11011-fd on 192.168.11.11:9102. ERR=Connection timed out
17-Aug 04:45 cxl05010-dir JobId 870: Error: openssl.c:86 TLS read/write failure.: ERR=error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac

Subsequent backup-jobs which failed:
17-Aug 04:45 cxl05010-dir JobId 0: Warning: bsock.c:129 Could not connect to Storage daemon on cxb24010.consultix.admin:9103. ERR=Connection refused (notice that the JobID is 0)

17-Aug 04:15 cxl05006-fd JobId 860: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe

17-Aug 04:50 cxl05010-dir JobId 871: Warning: bsock.c:129 Could not connect to Storage daemon on cxb24010.consultix.admin:9103. ERR=Connection refused

17-Aug 05:15 cxl05010-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on cxb24010.consultix.admin:9103. ERR=Connection refused

17-Aug 11:01 cxl05010-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on cxb24010.consultix.admin:9103. ERR=Connection refused

This error occurs if one (or more) File-daemons can’t be contacted by the bacula-director.

The Storage-daemon dies when the job whose file-daemon is unreachable is canceled
by the director.

The biggest problem is that one client whose file-daemon cant’t be reached will crash the storage-daemon and all the running and subsequent backups will fail.

The segfault did occur several times, twice because the IP-address of the client did change, while
is wasn’t changed in the bacula-fd.conf and once when the client was shut-down.

In terms of performance there should be no problems as the raid-array could handle a throughput to disk of 650Mbyte/s . The network-capacity should be no problem either, as the storage-server got a 10gbit connection, the clients a 1gbit one. Cpu- and memory load (on the storage-server) is low too (about 10% of one cpu-core per backup-job, about 900mb of ram-usage).

To make sure it is not a temporary issue, the bacula-director and the storage-daemon were restarted.
To make sure it is not an issue depending on the Number of concurrent jobs we tested it with three jobs at a time.
To make sure it is not a hardware related issue another storage-deamon system was installed and tested.

We could always re-produce the crash of the storage-daemon on other hardware under similar conditions.

Additional Info:

ProblemType: Bug
Uname: 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:28:05 UTC 2010 x86_64
Architecture: x86_64
Package the bug was found in: bacula-sd (5.0.1-1ubuntu1)
SourcePackage: bacula
Release: Ubuntu 10.04 LTS

Mathias Gug (mathiaz)
Changed in bacula (Ubuntu):
importance: Undecided → Medium
Revision history for this message
jraby (raby-jean) wrote :

I see the same behavior on our bacula server:

Dec 21 01:24:54 wopr bacula-sd: Bacula interrupted by signal 11: Segmentation violation
Dec 21 01:24:54 wopr kernel: [904270.954102] __ratelimit: 6 callbacks suppressed
Dec 21 01:24:54 wopr kernel: [904270.954107] bacula-sd[18266] general protection ip:43222b sp:7f6717d9e650 error:0 in bacula-sd[400000+53000]
Dec 21 01:24:54 wopr kernel: [904270.954692] bacula-sd[15837] general protection ip:43222b sp:7f671559a650 error:0 in bacula-sd[400000+53000]

21-Dec 00:49 wopr-dir JobId 13387: Fatal error: Network error with FD during Backup: ERR=Connection timed out
21-Dec 00:49 wopr.sepaq.com-sd JobId 13387: JobId=13387 Job="mars.2010-12-20_22.02.11_16" marked to be canceled.
21-Dec 00:49 wopr-dir JobId 13387: Fatal error: No Job status returned from FD.
21-Dec 01:24 jabba-fd JobId 13393: Fatal error: backup.c:1019 Network send error to SD. ERR=Broken pipe
21-Dec 01:55 wopr-dir JobId 13393: Error: Bacula wopr-dir 5.0.1 (24Feb10): 21-Dec-2010 01:55:05
21-Dec 01:24 wopr.sepaq.com-sd: ERROR in lock.c:268 Failed ASSERT: dev->blocked()
21-Dec 01:24 gaia-fd JobId 13381: Fatal error: backup.c:1019 Network send error to SD. ERR=Broken pipe

This is on 10.04 x86_64 bacula 5.0.1-1ubuntu1

Tom Ellis (tellis)
Changed in bacula (Ubuntu):
status: New → Confirmed
Revision history for this message
Chuck Short (zulcss) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Please try to obtain a backtrace following the instructions at http://wiki.ubuntu.com/DebuggingProgramCrash and upload the backtrace (as an attachment) to the bug report. This will greatly help us in tracking down your problem.

Revision history for this message
Philipp Lindt (plindt) wrote :

I have added a gdb-backtrace of the last segfault.

Revision history for this message
Philipp Lindt (plindt) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.