Fuel for OpenStack

Bug #1652008
Comment #2

Comment 2 for bug 1652008

Revision history for this message

Aleksei Shishkin (ashishkin) wrote on 2016-12-29:

Vladimir, I have tested changing of the resource binary in pacemaker configuration to mysqld_safe and back to mysqld - there is no sensible difference in cluster recovery time.

My concern is that we have to wait for p_mysqld resource start timeout before mysql recovery will be started. It would be great if we can improve the OCF script to start cluster recovery immediately without waiting for start timeout of the MySQL instance which was not supposed to be started (because no other MySQL instances exist in Galera cluster to join it). From my point of view, current behaviour is not very logical (and therefore it caused the criticism from the customer side).
Also I know some cases when p_mysqld resource start timeout was increased by customer up to 20 minutes due to big MySQL database size (usually caused by Zabbix database) to successfully complete replication of entire database during MySQL cluster recovery. In this case restoring of the database can take quite a significant time.

Please let me know if it is possible to improve the OCF script and fix mentioned behavior.

Also I have gathered outputs from «pcs status» to clearly show you how this behavior looks like:

Thu Dec 29 11:27:27 UTC 2016
——> MySQL killed on all nodes

Thu Dec 29 11:27:37 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

Thu Dec 29 11:28:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-6.test.domain.local node-7.test.domain.local ]
Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:17 UTC 2016
-----> MySQL started only on node-5

Thu Dec 29 11:28:24 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:39 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-7.test.domain.local
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-6.test.domain.local
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:55 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:29:11 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
      Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-6.test.domain.local node-7.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:33:03 UTC 2016
------> MySQL stopped on node-5.test.domain.local by Pacemaker

Clone Set: clone_p_mysqld [p_mysqld]
Stopped: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local

Thu Dec 29 11:33:13 UTC 2016
———> Pacemaker is starting MySQL one by one

Thu Dec 29 11:33:35 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
      Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local ]
     Stopped: [ node-6.test.domain.local node-7.test.domain.local]

Thu Dec 29 11:34:06 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
      Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local node-7.test.domain.local]
     Stopped: [ node-6.test.domain.local]

Thu Dec 29 11:34:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

------> MySQL started and everything is OK

Vladimir, I have tested changing of the resource binary in pacemaker configuration to mysqld_safe and back to mysqld - there is no sensible difference in cluster recovery time.

Please let me know if it is possible to improve the OCF script and fix mentioned behavior.

Also I have gathered outputs from «pcs status» to clearly show you how this behavior looks like:

Thu Dec 29 11:27:27 UTC 2016
——> MySQL killed on all nodes

Thu Dec 29 11:27:37 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

Thu Dec 29 11:28:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-6.test.domain.local node-7.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:17 UTC 2016
-----> MySQL started only on node-5

Thu Dec 29 11:28:24 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:39 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-7.test.domain.local
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-6.test.domain.local
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:55 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:33:03 UTC 2016
------> MySQL stopped on node-5.test.domain.local by Pacemaker

Clone Set: clone_p_mysqld [p_mysqld]
     Stopped: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local

Thu Dec 29 11:33:13 UTC 2016
———> Pacemaker is starting MySQL one by one

Thu Dec 29 11:34:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

------> MySQL started and everything is OK