Node starting (w/o bootstrap) that can't find primary just hangs

Bug #1413258 reported by Jay Janssen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Invalid
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

If I start a node normally and there is no cluster to find, the node waits in NON-PRIMARY until something happens. I've discovered that

wsrep_provider_options = "pc.wait_prim = false"

does not modify this behavior.

I cannot log into the node to do anything (like pc.bootstrap=true), nor can I seem to gracefully shutdown such a node. The only signal it seems to respect is kill -9:

[root@node3 mysql]# ps axf
...
18979 ? Ss 0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
19382 ? Sl 0:01 \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/u
18981 ? Ss 0:00 /bin/bash -ue /usr/bin/mysql-systemd start-post 18979
19874 ? S 0:00 \_ sleep 1

[root@node3 mysql]# killall mysqld
18979 ? Ss 0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
19382 ? Sl 0:01 \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/u
18981 ? Ss 0:00 /bin/bash -ue /usr/bin/mysql-systemd start-post 18979
19951 ? S 0:00 \_ sleep 1

[root@node3 mysql]# killall -9 mysqld
...
150121 15:13:28 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended

This behavior seems crappy all the way around:
1. We should have a way to modify this behavior
2. If we're going to trouble to start mysqld, we should be able to log in and set it primary
3. We should have a way to gracefully terminate such a process

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Download full text (4.7 KiB)

@Jay,

Tested with latest galera 3.x HEAD:

/pxc56d/bin/mysqld --defaults-file=/pxc56/etc/my.cnf.local --basedir=/pxc56d --user=mysql --wsrep-cluster-address=gcomm://NOTEXIST --wsrep-cluster-name=NOCLUSTER --wsrep-debug --skip-grant-tables
2015-01-26 12:14:54 0 [Warning] WSREP: wsrep_sst_receive_address is set to '127.0.0.1:4001' which makes it impossible for another host to reach this one. Please set it to the address which this node can be connected at by other cluster members.
2015-01-26 12:14:54 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2015-01-26 12:14:54 55221 [Note] WSREP: Setting wsrep_ready to 0
2015-01-26 12:14:54 55221 [Note] WSREP: Read nil XID from storage engines, skipping position init
2015-01-26 12:14:54 55221 [Note] WSREP: wsrep_load(): loading provider library '/pxc56/lib/libgalera_smm.so'
2015-01-26 12:14:54 55221 [Note] WSREP: wsrep_load(): Galera 3.9(rXXXX) by Codership Oy <email address hidden> loaded successfully.
2015-01-26 12:14:54 55221 [Note] WSREP: CRC-32C: using hardware acceleration.
2015-01-26 12:14:54 55221 [Note] WSREP: Found saved state: 5d314bdc-855b-11e4-9699-bbe796316f18:-1
2015-01-26 12:14:54 55221 [Note] WSREP: Passing config to GCS: base_host = 127.0.0.1; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.version = 1; evs.view_forget_timeout = PT24H; gcache.dir = /pxc56/datadir/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /pxc56/datadir//galera.cache; gcache.page_size = 128M; gcache.size = 500M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://127.0.0.1:4010; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum =
2015-01-26 12:14:54 55221 [Note] WSREP: Service thread queue flushed.
2015-01-26 12:14:54 55221 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2015-01-26 12:14:54 55221 [Note] WSREP: wsrep_sst_grab()
2015-01-26 12:14:54 55221 [Note] WSREP: Start replication
2015-01-26 12:14:54 55221 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2015-01-26 12:14:54 55221 [Note] WSREP: protonet asio version 0
2015-01-26 12:14:54 55221 [Note] WSREP: Using CRC-32C for message checksums.
2015-01-26 12:14:54 55221 [Note] WSREP: backend: asio
2015-01-26 12:14:54 55221 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2015-01-26 12:14:54 55221 [Note] WSREP: restore pc from disk failed
2015-01-26 12:14:54 55221 [Note] WSREP: GMCast version 0
2015-01-26 12:14:55 55221 [Warning] WSREP: Failed to r...

Read more...

Revision history for this message
Jay Janssen (jay-janssen) wrote : Re: [Bug 1413258] Re: Node starting (w/o bootstrap) that can't find primary just hangs
Download full text (7.6 KiB)

Raghu,
  Is the fix in experimental at this point?

On Mon, Jan 26, 2015 at 1:49 AM, Raghavendra D Prabhu <
<email address hidden>> wrote:

> @Jay,
>
> Tested with latest galera 3.x HEAD:
>
> /pxc56d/bin/mysqld --defaults-file=/pxc56/etc/my.cnf.local
> --basedir=/pxc56d --user=mysql
> --wsrep-cluster-address=gcomm://NOTEXIST --wsrep-cluster-name=NOCLUSTER
> --wsrep-debug --skip-grant-tables
> 2015-01-26 12:14:54 0 [Warning] WSREP: wsrep_sst_receive_address is set to
> '127.0.0.1:4001' which makes it impossible for another host to reach this
> one. Please set it to the address which this node can be connected at by
> other cluster members.
> 2015-01-26 12:14:54 0 [Warning] TIMESTAMP with implicit DEFAULT value is
> deprecated. Please use --explicit_defaults_for_timestamp server option (see
> documentation for more details).
> 2015-01-26 12:14:54 55221 [Note] WSREP: Setting wsrep_ready to 0
> 2015-01-26 12:14:54 55221 [Note] WSREP: Read nil XID from storage engines,
> skipping position init
> 2015-01-26 12:14:54 55221 [Note] WSREP: wsrep_load(): loading provider
> library '/pxc56/lib/libgalera_smm.so'
> 2015-01-26 12:14:54 55221 [Note] WSREP: wsrep_load(): Galera 3.9(rXXXX) by
> Codership Oy <email address hidden> loaded successfully.
> 2015-01-26 12:14:54 55221 [Note] WSREP: CRC-32C: using hardware
> acceleration.
> 2015-01-26 12:14:54 55221 [Note] WSREP: Found saved state:
> 5d314bdc-855b-11e4-9699-bbe796316f18:-1
> 2015-01-26 12:14:54 55221 [Note] WSREP: Passing config to GCS: base_host =
> 127.0.0.1; base_port = 4567; cert.log_conflicts = no; debug = no;
> evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period =
> PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S;
> evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3;
> evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout =
> PT5S; evs.user_send_window = 2; evs.version = 1; evs.view_forget_timeout =
> PT24H; gcache.dir = /pxc56/datadir/; gcache.keep_pages_size = 0;
> gcache.mem_size = 0; gcache.name = /pxc56/datadir//galera.cache;
> gcache.page_size = 128M; gcache.size = 500M; gcs.fc_debug = 0;
> gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no;
> gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit
> = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no;
> gmcast.listen_addr = tcp://127.0.0.1:4010; gmcast.segment = 0;
> gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false;
> pc.ignore_quorum =
> 2015-01-26 12:14:54 55221 [Note] WSREP: Service thread queue flushed.
> 2015-01-26 12:14:54 55221 [Note] WSREP: Assign initial position for
> certification: -1, protocol version: -1
> 2015-01-26 12:14:54 55221 [Note] WSREP: wsrep_sst_grab()
> 2015-01-26 12:14:54 55221 [Note] WSREP: Start replication
> 2015-01-26 12:14:54 55221 [Note] WSREP: Setting initial position to
> 00000000-0000-0000-0000-000000000000:-1
> 2015-01-26 12:14:54 55221 [Note] WSREP: protonet asio version 0
> 2015-01-26 12:14:54 55221 [Note] WSREP: Using CRC-32C for message
> checksums.
> 2015-01-26 12:14:54 55221 [Note] WSREP: backend: asio
> 2015-01-26 12:14:54 55221 [Warning] WSREP:...

Read more...

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Jay,

It is not, but it can be added easily. Let me know which one it is needed for - rpm/deb

Also note that, this will mean other fixes which are already on top of 3.8, so 3.8 + whole lot of other fixes. (so not just 3.8 + this fix).

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1792

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.