mysql shutdown is very slow in Joining state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Galera |
New
|
Undecided
|
Unassigned | ||
MySQL patches by Codership |
New
|
Undecided
|
Unassigned | ||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
New
|
Undecided
|
Unassigned |
Bug Description
If a node is in "Joining" state and we execute mysqladmin shutdown, it performs very slow.
It seems that the node still tries to apply all events from cache file, but fails, and throws a lot of spam into error.log:
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026734, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026778, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026823, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026867, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026913, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026956, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7027011, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027059, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027103, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027145, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027197, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027235, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027296, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027335, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027389, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027435, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027480, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027529, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027574, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027617, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027668, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027719, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027759, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027808, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027849, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027896, -77 (File descriptor in bad state)
130129 22:47:50 [Warning] WSREP: Failed to report last committed 7027937, -77 (File descriptor in bad state)
and it goes through all gcache.page.0000* files.
I think it is better to just cancel all apply threads when there is shutdown signal.
1) That is from galera worked thread in galera_ service_ thd.cpp
======= ======= =======
ssize_ t const ret(st- >gcs_.set_ last_applied( data.last_ committed_ ));
if (!exit)
{
if (data.act_ & A_LAST_COMMITTED)
{
if (gu_unlikely(ret < 0))
log_ warn << "Failed to report last committed "
<< data.last_ committed_ << ", " << ret
<< " (" << strerror (-ret) << ')';
/ / @todo: figure out what to do in this case
else
log_ debug << "Reported last committed: "
<< data.last_ committed_ ;
{
}
{
}
}
}
==================
kill_mysql and kill_server in turn call wsrep_stop_ replication, client_ connections( TRUE), but the TRUE
which call wsrep_close_
here indicates a graceful shutdown where it waits for all threads
to exit and then there is also
/* wait until appliers have stopped */ wait_appliers_ close(thd) ;
wsrep_
So, may be something can be changed here if possible. (The client_ connections( FALSE) to not
unireg_abort calls wsrep_close_
wait for close).
2) Having said that, I noticed that after it transitions to
JOINER and is undergoing IST, it doesn't handle SIGQUIT/SIGKILL
signals well, until it has reached a JOINED state.
You can see it here -- http:// sprunge. us/hPGU -- during " InnoDB:
DEBUG: update_statistics for sbtest/sbtest1. " it won't stop with
mysqladmin shutdown or SIGQUIT.