Ubuntu
openais package

Corosync init script doesn't shut down properly, causing split brain

Bug #505981 reported by halfgaar on 2010-01-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	openais (Ubuntu)	Triaged	Medium	Unassigned

Bug Description

Binary package hint: openais

When shutting down corosync (/etc/init.d/corosync stop), it is not cleanly shut down and upon restart, you will have a split brain of the underlying drbd resource.

Bug is discussed here:

https://bugzilla.redhat.com/show_bug.cgi?id=525589

The init script makes a mention of that bug and it indeed seems that their 'fix' is included, yet I still get a split brain whenever I stop and start corosync, or when I reboot the machine.

It happens to me when I restart the master server, but they seem to be saying it happens when you restart the slave.

OpenAIS version: 1.0.0-4
Arch: i386
Ubuntu 9.10

drbd8-utils: 2:8.3.3-0ubuntu1

Revision history for this message

Ante Karamatić (ivoks) wrote on 2010-01-11: Re: [Ubuntu-ha] [Bug 505981] [NEW] Corosync init script doesn't shut down properly, causing split brain

On 11.01.2010 17:21, halfgaar wrote:

> The init script makes a mention of that bug and it indeed seems that
> their 'fix' is included, yet I still get a split brain whenever I stop
> and start corosync, or when I reboot the machine.

Unfortunately, that fix doesn't solve the issue. As a workaround, I put
my node offline before stoping corosync.

IIRC, I've included that workaround into init script. I'll check if that
isn't the case.

Status: Triaged

Ante Karamatić (ivoks) on 2010-01-11

Changed in openais (Ubuntu):
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

halfgaar (wiebe-halfgaar) wrote on 2010-01-11:

This is what the init script says now:

do_stop()
{
        # Return
        # 0 if daemon has been stopped
        # 1 if daemon was already stopped
        # 2 if daemon could not be stopped
        # other if a failure occurred
        # Workaround for a shutdown bug in pacemaker
        # (https://bugzilla.redhat.com/show_bug.cgi?id=525589)
        if [ -r /usr/sbin/crm ]; then
                crm node standby
                start-stop-daemon --stop --quiet --retry=QUIT/5/QUIT/15 --pidfile $PIDFILE
                RETVAL="$?"
        else
                start-stop-daemon --stop --quiet --signal=QUIT --retry=5 --pidfile $PIDFILE
                RETVAL="$?"
        fi
        [ "$RETVAL" = 2 ] && return 2
        # Many daemons don't delete their pidfiles when they exit.
        rm -f $PIDFILE
        return "$RETVAL"
}

It does put the node on standby, but that isn't enough, apparently. Does this have to do with it being a background operation?

Also, when nodes are put in standby like that, they don't automatically start when corosync starts. So, the nodes are left offline when the machine boots.

Plus, it'd be better to check for -x, as opposed to -r.

Revision history for this message

halfgaar (wiebe-halfgaar) wrote on 2010-01-13:

What might be useful into to add, is that I'm testing a cluster setup on two old machines. One 400 MHz, 192MB ram, slow 2GB disk and one AMD 1600+, 512 MB ram, slow old 4 GB disk. Perhaps the problem only shows on my machines because they are so slow. It's still a bug of course, but it gives some additional insight.

I'll also try to patch the init script to work around the pacemaker problem a bit.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #525589
[CLOSED CURRENTRELEASE] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntuopenais package

Corosync init script doesn't shut down properly, causing split brain

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
openais package