Vladimir, I have tested changing of the resource binary in pacemaker configuration to mysqld_safe and back to mysqld - there is no sensible difference in cluster recovery time.
My concern is that we have to wait for p_mysqld resource start timeout before mysql recovery will be started. It would be great if we can improve the OCF script to start cluster recovery immediately without waiting for start timeout of the MySQL instance which was not supposed to be started (because no other MySQL instances exist in Galera cluster to join it). From my point of view, current behaviour is not very logical (and therefore it caused the criticism from the customer side).
Also I know some cases when p_mysqld resource start timeout was increased by customer up to 20 minutes due to big MySQL database size (usually caused by Zabbix database) to successfully complete replication of entire database during MySQL cluster recovery. In this case restoring of the database can take quite a significant time.
Please let me know if it is possible to improve the OCF script and fix mentioned behavior.
Also I have gathered outputs from «pcs status» to clearly show you how this behavior looks like:
Thu Dec 29 11:27:27 UTC 2016
——> MySQL killed on all nodes
Thu Dec 29 11:27:37 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]
Thu Dec 29 11:28:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-6.test.domain.local node-7.test.domain.local ]
Stopped: [ node-5.test.domain.local ]
Thu Dec 29 11:28:17 UTC 2016
-----> MySQL started only on node-5
Vladimir, I have tested changing of the resource binary in pacemaker configuration to mysqld_safe and back to mysqld - there is no sensible difference in cluster recovery time.
My concern is that we have to wait for p_mysqld resource start timeout before mysql recovery will be started. It would be great if we can improve the OCF script to start cluster recovery immediately without waiting for start timeout of the MySQL instance which was not supposed to be started (because no other MySQL instances exist in Galera cluster to join it). From my point of view, current behaviour is not very logical (and therefore it caused the criticism from the customer side).
Also I know some cases when p_mysqld resource start timeout was increased by customer up to 20 minutes due to big MySQL database size (usually caused by Zabbix database) to successfully complete replication of entire database during MySQL cluster recovery. In this case restoring of the database can take quite a significant time.
Please let me know if it is possible to improve the OCF script and fix mentioned behavior.
Also I have gathered outputs from «pcs status» to clearly show you how this behavior looks like:
Thu Dec 29 11:27:27 UTC 2016
——> MySQL killed on all nodes
Thu Dec 29 11:27:37 UTC 2016 test.domain. local node-6. test.domain. local node-7. test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.
Thu Dec 29 11:28:16 UTC 2016 test.domain. local node-7. test.domain. local ] test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-6.
Stopped: [ node-5.
Thu Dec 29 11:28:17 UTC 2016
-----> MySQL started only on node-5
Thu Dec 29 11:28:24 UTC 2016 mysql-wss) : FAILED node-7. test.domain. local test.domain. local ] test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
p_mysqld (ocf::fuel:
Started: [ node-6.
Stopped: [ node-5.
Thu Dec 29 11:28:39 UTC 2016 mysql-wss) : FAILED node-7. test.domain. local mysql-wss) : FAILED node-6. test.domain. local test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
p_mysqld (ocf::fuel:
p_mysqld (ocf::fuel:
Stopped: [ node-5.
Thu Dec 29 11:28:55 UTC 2016 mysql-wss) : FAILED node-7. test.domain. local test.domain. local ] test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
p_mysqld (ocf::fuel:
Started: [ node-6.
Stopped: [ node-5.
Thu Dec 29 11:29:11 UTC 2016 test.domain. local node-7. test.domain. local ] test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-6.
Stopped: [ node-5.
Thu Dec 29 11:33:03 UTC 2016 test.domain. local by Pacemaker
------> MySQL stopped on node-5.
Clone Set: clone_p_mysqld [p_mysqld] test.domain. local node-6. test.domain. local node-7. test.domain. local
Stopped: [ node-5.
Thu Dec 29 11:33:13 UTC 2016
———> Pacemaker is starting MySQL one by one
Thu Dec 29 11:33:35 UTC 2016 test.domain. local ] test.domain. local node-7. test.domain. local]
Clone Set: clone_p_mysqld [p_mysqld]
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.
Stopped: [ node-6.
Thu Dec 29 11:34:06 UTC 2016 test.domain. local node-7. test.domain. local] test.domain. local]
Clone Set: clone_p_mysqld [p_mysqld]
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.
Stopped: [ node-6.
Thu Dec 29 11:34:16 UTC 2016 test.domain. local node-6. test.domain. local node-7. test.domain. local ]
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.
------> MySQL started and everything is OK