Galera

Bug #843752
Comment #3

Comment 3 for bug 843752

Revision history for this message

Henrik Ingo (hingo) wrote on 2011-09-08: Re: [Bug 843752] Re: Galera doesn't detect that a node has diverged, same grastate.dat on nodes having different data

On Wed, Sep 7, 2011 at 4:51 PM, Alex Yurchenko
<email address hidden> wrote:
> I think you're gradually arriving to the understanding we have now. And
> this is good, because what you have reported as a bug is a useful
> feature which you have masterfully abused ;)

No, I understand the feature but this is still a bug.

> And not only that. For example you want to avoid state transfer
> altogether if you shut down and then restart an idle cluster (if you're
> doing bulk upgrade)

My proposal does that. In all cases must UUID only be changed if a
node is not idle.

> Conceptually, UUID:seqno pair is a database state identifier and
> _ideally_ should be persistently stored in the database, grastate file
> is just a workaround.

Actually, as far as I'm concerned if you store it in mysql/var and I
can access the value with SQL, it is part of the database. Since you
zero it out when the server is running, and write to it only on clean
shutdown, there also is no crash safety issues afaics.

> And if you shut down the server at a given
> uuid:seqno, then when you start it, like it or not, it has the same
> state, and, therefore, should have the same ID.

This is correct.

> In the case that you provided you had three servers with the same
> initial database state and, naturally, when you started them they had
> the same initial uuid:seqno - that's where they left off. Starting one
> node, changing its database (making it most updated) and then shutting
> it down and starting another without prior synchronization with the most
> updated node is a grieve misuse of a database cluster, and it is not
> specific to Galera cluster.

I can think of realistic use cases how this could happen. For
instance, from a cleanly shutdown cluster, you accidentally copy paste
"-g gcomm://" to all nodes when you really wanted to start a cluster.
That you then manage to issue the exact same amount of transactions on
each node is up to chance, but can happen.

I think I managed to create this kind of state early on in my testing,
which is how I found grastate.dat and then came up with this as a test
case.

> While there might be a way to protect against such a misuse (like
> incremental digest), we don't think we want to to focus on it at the
> moment. In any case database state ID is a state ID, and changing it
> arbitrarily would defeat its purpose.

I agree that this is not a high priority bug as it rarely happens and
is due to user action. Just that the UUID:seqno doesn't correctly in
all cases identify 2 diverged states, even if this is the intent.

Since you already calculate a checksum for each transaction that is
replicated, you could actually just XOR those into a checksum saved in
grastate.dat. It could replace or be used alongside seqno. This is
surely a much more correct approach than trying to detect changes in
cluster configuration and changing UUID as I proposed first. In fact,
it would even work correctly in the opposite situation: Suppose I have
two nodes in the same state, but not connected. I run the same
mysqldump against both. Then I connect them into each other to form a
cluster - since both seqno and the checksum match, Galera would
*correctly* note that the databases are in the same state.

>> A possible tweak to my proposed solution would be for the selected
>> quorum to keep the same UUID, but the non-quorum partition would need
>> to
>> change UUID (if seqno increases).
>
> Note, that by supplying an trivial cluster address you implicitly tell
> the node that is has the quorum (no other cluster members).

Yes, but it is an easy user error to make.

These new options should perhaps be a separate thread. Anyway:

>> --sst-force : Discard the grastate.dat on this node and do SST from
>> the primary component.
>> --sst-disable : Don't do SST. If grastate.dat does not match with the
>> primary component being joined, report error and shut down, but don't
>> delete my data.
>> --sst-ondemand or --sst-initial : Do SST if needed when joining
>> cluster.
>>
>> (Currently the last one is default behavior.)
>
> So I guess the last switch is not needed then.

There should still be a switch also for the default. For instance I
may set non-default choice in my.cnf and then I want to override that
by giving the default choice on command line, or via SQL in mysql
console (ok, that doesn't make sense in this case).

I now realize my style is wrong above. MySQL style of course would be
to have a variable, such as:
sst-on-startup=force
sst-on-startup=disable
sst-on-startup=ondemand

(ondemand is a bit misleading, perhaps more correct is "ifneeded".
ondemand is also used for query cache setting - but also there the
meaning is different, the demand is provided by user, not a state.)

> You can obviously achieve the --sst-force functionality by simply
> removing grastate.dat. --sst-disable does not seem to have a compelling
> use case, since you normally can't expect states to match on a working
> cluster. But we might consider implementing such switches if there is
> some real need for this.

I'm able to do that. Just thought if it happens commonly, then having
a command line switch is more user friendly.

NDB nodes have parameter --initial which deletes all data on that node
before joining cluster.

henrik

--
<email address hidden>
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559

On Wed, Sep 7, 2011 at 4:51 PM, Alex Yurchenko
<843752@bugs.launchpad.net> wrote:
> I think you're gradually arriving to the understanding we have now. And
> this is good, because what you have reported as a bug is a useful
> feature which you have masterfully abused ;)