Full reassemble of Galera cluster fails in case of epoch divergence

Bug #1388779 reported by Dennis Dmitriev
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
5.1.x
Invalid
High
Bogdan Dobrelya
6.0.x
Won't Fix
High
Bogdan Dobrelya
6.1.x
Won't Fix
High
Bogdan Dobrelya
7.0.x
Won't Fix
High
Bogdan Dobrelya
8.0.x
Won't Fix
High
Bogdan Dobrelya
Mitaka
Won't Fix
High
Sergii Golovatiuk
Newton
Won't Fix
High
Bogdan Dobrelya

Bug Description

Regularly observed on the system test 'ceph_ha_restart' , this time on: http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.ubuntu.thread_3/15/

Steps to reproduce:
            1. Create cluster (Ubuntu, nova-network flat-dhcp, Ceph for images and volumes)
            2. Add 3 nodes with controller and ceph OSD roles
            3. Add 1 node with ceph OSD roles
            4. Add 2 nodes with compute and ceph OSD roles
            5. Deploy the cluster
            6. Reset all nodes.
            7. Check cluster status with 'crm status' and pacemaker logs on all controllers.

If mysql failed to start after nodes are reset, then pacemaker hangs on waiting for mysql status for 475 sec. That cause a long time to re-assemble cluster for other resources such as rabbitmq.

Related bug about rabbitmq: https://bugs.launchpad.net/fuel/+bug/1383247

There is more detailed information while CI test was running:

`crm status` right after the nodes was rebooted (Nov 3 09:43:15 2014) : http://paste.openstack.org/show/128612/

Pacemaker logs taken from ssh session:
- from controller-1: http://paste.openstack.org/show/128613/
- from controller-2: http://paste.openstack.org/show/128614/
- from controller-3: http://paste.openstack.org/show/128618/

`crm status` before the timeout of the test (Nov 3 09:49:12 2014) : http://paste.openstack.org/show/128619/

In the pacemaker logs from controller-2 and 3 is the following warning:
"<28>Nov 3 09:48:19 node-6 lrmd[2021]: warning: operation_finished: p_mysql_start_0:4218 - timed out after 475000ms"

Pacemaker was not doing any operations for ~8 minutes on all controllers until this timeout appeared, so the rabbitmq resource wasn't processed too.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Changed in fuel:
importance: Undecided → High
tags: added: ha pacemaker
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There are no any "timed out after 475000ms" log records in snapshot, please make sure you created snapshot as appropriate

Changed in fuel:
status: New → Incomplete
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Looks like some logs weren't collected after "2014-11-03 09:40:" despite the fact that the diagnostic snapshot was made at 09:50.

I've reverted the env and manually collected all logs from nodes (node-3 and node-5 are not loaded).
Please look at the pacemaker.log on the controller nodes node-1, node-4 and node-6.

Changed in fuel:
status: Incomplete → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to mysqld.log from node-6

Something went wrong then galera cluster had been reassambling for 8 minutes since pcs ordered p_mysql to start (http://pastebin.com/ca3hEccL, see FATAL):

<27>Nov 3 09:41:05 node-6 mysqld: 2014-11-03 09:41:05 6763 [ERROR] WSREP: Local state seqno (8961) is greater than group seqno (8955): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)

And starting from this time, it only had been reporting MySQL is not running / mysqld.pid of MySQL server not found until pcs time out at 9:48:19

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Sergii Golovatiuk (sgolovatiuk)
summary: - Pacemaker freezes while it is waiting a resource status
+ Full reassemble of Galera cluster failure
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote : Re: Full reassemble of Galera cluster failure

Problem is also reproduced on 5.1.1 ISO
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "37"
  build_id: "2014-11-24_21-01-00"
  astute_sha: "dade74af41d4972fe05a1c16ae1db2a2e60c6715"
  fuellib_sha: "444339cae21c369c1d95e96c1059d4099077138e"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "24b956d739e8e9d8f728701522c6fa8364526c45"
  fuelmain_sha: "3797f2f2eff42812715840293a618de73fdef26f"

pacemaker log on node-2:
<29>Nov 25 06:04:17 node-2 attrd[1818]: notice: attrd_perform_update: Sent update 21: pingd=1000
<28>Nov 25 06:11:40 node-2 lrmd[1817]: warning: child_timeout_callback: p_mysql_start_0 process (PID 3531) timed out
<28>Nov 25 06:11:40 node-2 lrmd[1817]: warning: operation_finished: p_mysql_start_0:3531 - timed out after 475000ms
[12:37:54] Dennis Dmitriev: <29>Nov 25 06:11:41 node-2 crmd[1820]: notice: process_lrm_event: LRM operation vip__public_old_start_0 (call=70, rc=0, cib-update=24, confirmed=true) ok
<28>Nov 25 06:19:35 node-2 lrmd[1817]: warning: child_timeout_callback: p_mysql_start_0 process (PID 17034) timed out
<28>Nov 25 06:19:35 node-2 lrmd[1817]: warning: operation_finished: p_mysql_start_0:17034 - timed out after 475000ms

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

According to galeracluster.com/2013/10/order-of-business/ that case cannot be resolved automatically as nodes have completely different commits and cannot be diverged.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Sergii,
so what do we do then? Recommend anything in docs/release notes, or what?

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

I am going to produce patch today which fixes the issue.

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Sergii, I do not see a patch in the bug. Did you forget to link it to the bug?..

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
summary: - Full reassemble of Galera cluster failure
+ Full reassemble of Galera cluster fails in case epoch divergence
summary: - Full reassemble of Galera cluster fails in case epoch divergence
+ Full reassemble of Galera cluster fails in case of epoch divergence
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/139596
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=134f70e8332d9a52f17565e23e8b14f092efb735
Submitter: Jenkins
Branch: master

commit 134f70e8332d9a52f17565e23e8b14f092efb735
Author: Sergii Golovatiuk <email address hidden>
Date: Fri Dec 5 11:27:12 2014 +0100

    Fix MySQL destructive test

    * I7f50e32ced83af8f32fd7907cb5fc723055da121 separated pid directories for
      Pacemaker managed resources moving PID files to $HA_RSCTMP/$__SCRIPT_NAME
      This patch changes test to use name rather than PID
    * Change pkill to to pkill -x to kill mysqld rathar than all processes
      that match mysqld pattern

    Closes-Bug: 1399605
    Related-Bug: 1388779
    Change-Id: I34d540200f0ce2ff17b30e20a43f4d13c86b7492
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/137105
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2447e44989e3af8dc1b4f4652b1180f25cb5322b
Submitter: Jenkins
Branch: master

commit 2447e44989e3af8dc1b4f4652b1180f25cb5322b
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Nov 25 16:43:00 2014 +0100

    Kill xtrabackup spawned processes

    * Send KILL signals to all processed spawned by mysqld
    * Refactor OCF script to make it more simple and more understandable

    Closes-Bug: 1388779

    Change-Id: Ie3f6149d3f97f5e8cf3f1f3329b9e335e551b3e6
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139674

Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/139674
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0524973e9a4470a8269542c8327dfaec750831b3
Submitter: Jenkins
Branch: master

commit 0524973e9a4470a8269542c8327dfaec750831b3
Author: Sergii Golovatiuk <email address hidden>
Date: Fri Dec 5 17:19:28 2014 +0100

    Fix type in pkill function

    In Different --pgroup behaves differently. -g behaves exactly the same

    Change-Id: I3254f333b9caf3d417ec595896c6695fad617f1e
    Closes-Bug: 1388779

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

This issue was affected by issue in corosync package. It doesn't affect 5.1

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This issue can be easy reproduced by recurring network partitions, see the duplicated bug details. Cannot be fixed, AFAIKT, a known issue for docs perhaps. Only manual recovery is possible.

Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/313273

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/312911
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=03991059adcc99989431c27f38c5ffa4ae35b486
Submitter: Jenkins
Branch: master

commit 03991059adcc99989431c27f38c5ffa4ae35b486
Author: Bogdan Dobrelya <email address hidden>
Date: Thu May 5 13:43:24 2016 +0200

    Rework SST check, fix possible masters search

    * Fix racing of monitoring with SST
    * Fix printf multilines sorting
      Expected: printf -- '%s\n' ${a} | sort -u (returns a sorted multiline)
      Actuacl: printf -- '%s\n' "$a" | sort -u (returns a single string)
    * Fix possible masters search, by the greatest SEQNO found for a
      magority UUID
    (Those blocks each other in CI and must be fixed at once)

    Closes-bug: #1574999
    Closes-bug: #1578278
    Closes-bug: #1388779

    Change-Id: I3d0d376e6bef3ccc3e738731b71f4dd60a59e653
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/313273
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=0197b09fc6ef25657a81a8398b6da0ea1b0603af
Submitter: Jenkins
Branch: stable/mitaka

commit 0197b09fc6ef25657a81a8398b6da0ea1b0603af
Author: Bogdan Dobrelya <email address hidden>
Date: Thu May 5 13:43:24 2016 +0200

    Rework SST check, fix possible masters search

    * Fix racing of monitoring with SST
    * Fix printf multilines sorting
      Expected: printf -- '%s\n' ${a} | sort -u (returns a sorted multiline)
      Actuacl: printf -- '%s\n' "$a" | sort -u (returns a single string)
    * Fix possible masters search, by the greatest SEQNO found for a
      magority UUID
    (Those blocks each other in CI and must be fixed at once)

    Fuel-CI: disable

    Closes-bug: #1574999
    Closes-bug: #1578278
    Closes-bug: #1388779

    Change-Id: I3d0d376e6bef3ccc3e738731b71f4dd60a59e653
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 03991059adcc99989431c27f38c5ffa4ae35b486)
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/315989

tags: added: tech-debt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/316802

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/317978

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This is a corner case requiring manual resolving, cannot be fixed automatically w/o data loss. Such decisions shall be made by ops and recovery steps to be done manually.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/8.0)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/317978

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/7.0)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/7.0
Review: https://review.openstack.org/316802

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/6.1)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/315989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/374219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/7.0)

Reviewed: https://review.openstack.org/374219
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=f9a2d479f3687157d2b17a927a09ce5f995522d6
Submitter: Jenkins
Branch: stable/7.0

commit f9a2d479f3687157d2b17a927a09ce5f995522d6
Author: Denis Puchkin <email address hidden>
Date: Wed Sep 21 17:38:54 2016 +0300

    Backport mysql OCF from stable/mitaka

    backport mysql ocf script from stable/mitaka

    Closes-bug: #1524826
    Closes-bug: #1542256
    Closes-bug: #1572239
    Closes-bug: #1572557
    Closes-bug: #1572601
    Closes-bug: #1574747
    Closes-bug: #1574497
    Closes-bug: #1576244
    Closes-bug: #1574999
    Closes-bug: #1578278
    Closes-bug: #1388779
    Closes-bug: #1574999
    Closes-bug: #1576244
    Closes-bug: #1583173
    Closes-bug: #1585125

    Change-Id: I1cc6f95884a8fbd5c3418ede89bdf9ec6864bdc8

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/377597

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/377597
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b3873f5f5a0bb1526b1269f163223ae48d6e21f5
Submitter: Jenkins
Branch: stable/8.0

commit b3873f5f5a0bb1526b1269f163223ae48d6e21f5
Author: Denis Puchkin <email address hidden>
Date: Tue Sep 27 13:20:25 2016 +0300

    Backport mysql OCF from stable/mitaka

    backport mysql ocf script from stable/mitaka

    Closes-bug: #1524826
    Closes-bug: #1542256
    Closes-bug: #1572239
    Closes-bug: #1572557
    Closes-bug: #1572601
    Closes-bug: #1574747
    Closes-bug: #1574497
    Closes-bug: #1576244
    Closes-bug: #1574999
    Closes-bug: #1578278
    Closes-bug: #1388779
    Closes-bug: #1574999
    Closes-bug: #1576244
    Closes-bug: #1583173
    Closes-bug: #1585125

    Change-Id: I1cc6f95884a8fbd5c3418ede89bdf9ec6864bdc8

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.