Removing one ovn-central unit doesn't cluster/leave SB and NB clusters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
charm-interface-ovsdb |
New
|
Undecided
|
Martin Kalcok | ||
charm-ovn-central |
Fix Committed
|
High
|
Martin Kalcok |
Bug Description
charm ovn-central rev. 7
after removing one ovn-central unit, the server was not removed from the cluster.
Node had IP: 10.10.240.102
# ovs-appctl -t /var/run/
670a
Name: OVN_Southbound
Cluster ID: 5ff3 (5ff30308-
Server ID: 670a (670afac0-
Address: ssl:10.
Status: cluster member
Role: leader
Term: 7537
Leader: self
Vote: self
Election timer: 4000
Log: [54683521, 54683813]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->ef67 (->e7aa) ->3d49 <-3d49 <-ef67
Servers:
670a (670a at ssl:10.
ef67 (ef67 at ssl:10.
e7aa (e7aa at ssl:10.
3d49 (3d49 at ssl:10.
We had to remove it manually with:
# ovs-appctl -t /var/run/
summary: |
- Removing one ovn-central doesn't remove the server from the Southbound - cluster + Removing one ovn-central unit doesn't remove the server from the + Southbound cluster |
summary: |
- Removing one ovn-central unit doesn't remove the server from the - Southbound cluster + Removing one ovn-central unit doesn't cluster/leave SB and NB clusters |
Changed in charm-interface-ovsdb: | |
assignee: | nobody → Martin Kalcok (martin-kalcok) |
Subscribing field-medium.
This issue can cause major outages when sb and nb clusters can't elect a leader because of stale ovn-central units in the cluster.
Starting from 3 ovn-central, we had to remove 2 ovn-central (because of hw maintenance) and added two back. We didn't manually cluster/leave. The two raft clusters were unable to elect a leader because both SB and NB had 4 members of which 2 down and the 5th unit could not join the cluster
The 3 left ovn-central were /2 /3 /4
To recreate the clusters we followed these steps:
Recovery steps:
1. stop all units:
juju run-action ovn-central/2 pause --wait
juju run-action ovn-central/3 pause --wait
juju run-action ovn-central/4 pause --wait
2. created standalone on ovn-central/2 to-standalone /tmp/standalone _ovnsb_ db.db /var/lib/ ovn/ovnsb_ db.db to-standalone /tmp/standalone _ovnnb_ db.db /var/lib/ ovn/ovnnb_ db.db
# ovsdb-tool cluster-
# ovsdb-tool cluster-
3. create clusters ovn/ovnsb_ db.db /tmp/standalone _ovnsb_ db.db ssl:<ovn- central- 2-ip>:6644 ovn/ovnnb_ db.db /tmp/standalone _ovnnb_ db.db ssl:<ovn- central- 2-ip>:6643
ovsdb-tool create-cluster /var/lib/
ovsdb-tool create-cluster /var/lib/
4. Resume ovn-central/2
5. Join cluster from ovn-central/3
ovsdb-tool --cid=< new-sb- cid-took- from-ovn- central- 2> join-cluster /var/lib/ ovn/ovnsb_ db.db OVN_Southbound ssl:<ovn- central- 3-ip>:6644 ssl:<ovn- central- 2-ip>:6644 new-nb- cid-took- from-ovn- central- 2> join-cluster /var/lib/ ovn/ovnnb_ db.db OVN_Northbound ssl:<ovn- central- 3-ip>:6643 ssl:<ovn- central- 2-ip>:6643
ovsdb-tool --cid=<
6. Resuming /3
juju run-action ovn-central/3 resume --wait
7. Fixing leader-set "<new-nb- cid-took- from-ovn- central- 2>" "<new-sb- cid-took- from-ovn- central- 2>"
juju run -u ovn-central/leader leader-set nb_cid=
juju run -u ovn-central/leader leader-set sb_cid=
8. Resuming /4
juju run-action ovn-central/4 resume --wait