Corosync init script doesn't shut down properly, causing split brain

Bug #505981 reported by halfgaar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
openais (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

Binary package hint: openais

When shutting down corosync (/etc/init.d/corosync stop), it is not cleanly shut down and upon restart, you will have a split brain of the underlying drbd resource.

Bug is discussed here:

https://bugzilla.redhat.com/show_bug.cgi?id=525589

The init script makes a mention of that bug and it indeed seems that their 'fix' is included, yet I still get a split brain whenever I stop and start corosync, or when I reboot the machine.

It happens to me when I restart the master server, but they seem to be saying it happens when you restart the slave.

OpenAIS version: 1.0.0-4
Arch: i386
Ubuntu 9.10

drbd8-utils: 2:8.3.3-0ubuntu1

Revision history for this message
Ante Karamatić (ivoks) wrote : Re: [Ubuntu-ha] [Bug 505981] [NEW] Corosync init script doesn't shut down properly, causing split brain

On 11.01.2010 17:21, halfgaar wrote:

> The init script makes a mention of that bug and it indeed seems that
> their 'fix' is included, yet I still get a split brain whenever I stop
> and start corosync, or when I reboot the machine.

Unfortunately, that fix doesn't solve the issue. As a workaround, I put
my node offline before stoping corosync.

IIRC, I've included that workaround into init script. I'll check if that
isn't the case.

          Status: Triaged

Ante Karamatić (ivoks)
Changed in openais (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
halfgaar (wiebe-halfgaar) wrote :

This is what the init script says now:

do_stop()
{
        # Return
        # 0 if daemon has been stopped
        # 1 if daemon was already stopped
        # 2 if daemon could not be stopped
        # other if a failure occurred
        # Workaround for a shutdown bug in pacemaker
        # (https://bugzilla.redhat.com/show_bug.cgi?id=525589)
        if [ -r /usr/sbin/crm ]; then
                crm node standby
                start-stop-daemon --stop --quiet --retry=QUIT/5/QUIT/15 --pidfile $PIDFILE
                RETVAL="$?"
        else
                start-stop-daemon --stop --quiet --signal=QUIT --retry=5 --pidfile $PIDFILE
                RETVAL="$?"
        fi
        [ "$RETVAL" = 2 ] && return 2
        # Many daemons don't delete their pidfiles when they exit.
        rm -f $PIDFILE
        return "$RETVAL"
}

It does put the node on standby, but that isn't enough, apparently. Does this have to do with it being a background operation?

Also, when nodes are put in standby like that, they don't automatically start when corosync starts. So, the nodes are left offline when the machine boots.

Plus, it'd be better to check for -x, as opposed to -r.

Revision history for this message
halfgaar (wiebe-halfgaar) wrote :

What might be useful into to add, is that I'm testing a cluster setup on two old machines. One 400 MHz, 192MB ram, slow 2GB disk and one AMD 1600+, 512 MB ram, slow old 4 GB disk. Perhaps the problem only shows on my machines because they are so slow. It's still a bug of course, but it gives some additional insight.

I'll also try to patch the init script to work around the pacemaker problem a bit.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.