Vcenter-as-compute: After reboot of nodes, docker controller keeps restarting

Bug #1716577 reported by Sarath
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Invalid
High
Unassigned
Trunk
Invalid
High
Unassigned

Bug Description

Version: 4.0.1.0-74-mitaka
Topology: 3node HA with multiple computes (multi-cluster Esxi) & Kvm

After reboot of nodes, docker controller keeps restarting and so the services keep flapping

root@5a10s30:~#
root@5a10s30:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6b74c5311fc8 10.87.36.15:5100/contrail_4_0_1_74_vc_new-contrail-analytics:4.0.1-74 "/bin/sh -c /entry..." 8 hours ago Up 4 hours analytics
33987f17160e 10.87.36.15:5100/contrail_4_0_1_74_vc_new-contrail-analyticsdb:4.0.1-74 "/bin/sh -c /entry..." 8 hours ago Up 4 hours analyticsdb
65380c0716f0 10.87.36.15:5100/contrail_4_0_1_74_vc_new-contrail-controller:4.0.1-74 "/bin/sh -c /entry..." 8 hours ago Up About a minute controller
root@5a10s30:~# openstack-status | grep active | wc -l
docker exec -it controller bash
Warning keystonerc not sourced
19
root@5a10s30:~# docker exec -it controller bash
contrail-status
root@5a10s30(controller):/# contrail-status
exit
== Contrail Control ==
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Config ==
contrail-api:0 initializing (Database:RabbitMQ[] connection down)
contrail-config-nodemgr active
contrail-device-manager backup
contrail-schema backup
contrail-svc-monitor backup

== Contrail Config Database==
contrail-database: active

== Contrail Web UI ==
contrail-webui active
contrail-webui-middleware active

root@5a10s30(controller):/# exit
exit
root@5a10s30:~# docker exec -it analytics bash
contrail-status
contrail-status
root@5a10s30(analytics):/# contrail-status
exit
== Contrail Analytics ==
contrail-alarm-gen initializing (Redis-UVE:10.87.36.10:6381[None], Redis-UVE:10.87.36.11:6381[None], Database:RabbitMQ[], ApiServer:Config[], Redis-UVE:10.87.36.12:6381[None], Zookeeper:AlarmGenerator[] connection down)
contrail-analytics-api initializing (ApiServer, UvePartitions:UVE-Aggregation[Partitions:0] connection down)
contrail-analytics-nodemgr initializing (NTP state unsynchronized.)
contrail-collector active
contrail-query-engine active
contrail-snmp-collector initializing (ApiServer:SNMP[] connection down)
contrail-topology initializing (ApiServer:Config[] connection down)

root@5a10s30(analytics):/# exit
exit
root@5a10s30:~# docker exec -it analyticsdb bash
contrail-status
root@5a10s30(analyticsdb):/# contrail-status
exit
== Contrail Database ==
contrail-database: active

contrail-database-nodemgr initializing (NTP state unsynchronized.)
kafka active

root@5a10s30(analyticsdb):/# exit

Revision history for this message
kamlesh parmar (kparmar) wrote :

Assigning to sachin.

The rabbitmq-server is not up after reboot of the host.

== Contrail Config ==
contrail-api:0 initializing (Database:RabbitMQ[] connection down)
contrail-config-nodemgr active
contrail-device-manager backup
contrail-schema backup
contrail-svc-monitor backup

== Contrail Config Database==
contrail-database: active

== Contrail Web UI ==
contrail-webui active
contrail-webui-middleware active

root@5a10s31(controller):/# rabbitmqctl cluster_status
Cluster status of node rabbit@5a10s31 ...
Error: unable to connect to node rabbit@5a10s31: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit@5a10s31]

rabbit@5a10s31:
  * connected to epmd (port 4369) on 5a10s31
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on 5a10s31
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-389@5a10s31'
- home dir: /var/lib/rabbitmq
- cookie hash: gKJB0lQWSF4zN4sMvpNz8g==

Revision history for this message
Ignatious Johnson Christopher (ijohnson-x) wrote :

I see external_rabbitmq_servers populated in the /etc/contrailctl/controller.conf.

external_rabbitmq_servers = 10.87.36.10, 10.87.36.11, 10.87.36.12

In which case internal ansible will not configure/start rabbitmq in the controller container.
So there is no rabbit available for contrail-api to connect too. So controller container restarts.

Please do not set external_rabbitmq_servers if you need to use rabbit in controller container.

Thanks,
Ignatious

Revision history for this message
Sarath (nsarath) wrote :

Please find logs @

nsarath@ubuntu-build04:/auto/cores/1716577$ ls -ltrd *
-rwxrwxrwx 1 nsarath test 346511360 Sep 12 12:32 A-Ctrl-log.tar
-rwxrwxrwx 1 nsarath test 357427200 Sep 12 12:36 B-Ctrl-log.tar
-rwxrwxrwx 1 nsarath test 285317120 Sep 12 12:36 C-Ctrl-log.tar
-rwxrwxrwx 1 nsarath test 72919040 Sep 12 12:41 A-Anal-log.tar
-rwxrwxrwx 1 nsarath test 230318080 Sep 12 12:59 A-AnalDB-log.tar
-rwxrwxrwx 1 nsarath test 44871680 Sep 12 13:08 C-Vcplugin-log.tar
-rwxrwxrwx 1 nsarath test 2570240 Sep 12 13:09 D-LB-log.tar
-rwxrwxrwx 1 nsarath test 67000320 Sep 12 13:10 C-Anal-log.tar
-rwxrwxrwx 1 nsarath test 61696000 Sep 12 13:10 B-Anal-log.tar
-rwxrwxrwx 1 nsarath test 41932800 Sep 12 13:10 D-Vcplugin-log.tar
-rwxrwxrwx 1 nsarath test 165079040 Sep 12 13:10 C-AnalDB-log.tar
-rwxrwxrwx 1 nsarath test 896819200 Sep 12 13:11 C-Openstack-log.tar
-rwxrwxrwx 1 nsarath test 896819200 Sep 12 13:11 A-Openstack-log.tar
-rwxrwxrwx 1 nsarath test 896819200 Sep 12 13:11 B-Openstack-log.tar

Revision history for this message
kamlesh parmar (kparmar) wrote :

Do we need to do anything in the configuration to bring back the rabbitmq cluster to operational state automatically, after unplanned shutdown of the cluster nodes?

The 5a10s31 node is not starting rabbitmq-server. With this in the logs:

BOOT FAILED
===========

Timeout contacting cluster nodes: [rabbit@5a10s29ctrl,rabbit@5a10s30ctrl].

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

DIAGNOSTICS
===========

attempted to contact: [rabbit@5a10s29ctrl,rabbit@5a10s30ctrl]

rabbit@5a10s29ctrl:
  * unable to connect to epmd (port 4369) on 5a10s29ctrl: address (cannot connect to host/port)

rabbit@5a10s30ctrl:
  * unable to connect to epmd (port 4369) on 5a10s30ctrl: address (cannot connect to host/port)

current node details:
- node name: rabbit@5a10s31ctrl

From the rabbitmq-server documentation, it seems like some administrative action is required. This in case of unplanned shutdown:
https://www.rabbitmq.com/man/rabbitmqctl.1.man.html

force_boot

Ensure that the node will start next time, even if it was not the last to shut down.
Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not. But sometimes that's not possible: for instance if the entire cluster loses power then all nodes may think they were not the last to shut down.
In such a case you can invoke rabbitmqctl force_boot while the node is down. This will tell the node to unconditionally start next time you ask it to. If any changes happened to the cluster after this node shut down, they will be lost.
If the last node to go down is permanently lost then you should use rabbitmqctl forget_cluster_node --offline in preference to this command, as it will ensure that mirrored queues which were mastered on the lost node get promoted.
For example:
rabbitmqctl force_boot
This will force the node not to wait for other nodes next time it is started.

Revision history for this message
kamlesh parmar (kparmar) wrote :

This seem to fix the rabbitmq cluster

stop rabbitmq-server,

root@5a10s31:~# rm -rf /var/lib/rabbitmq/mnesia/

kill epmd process

start rabbitmq-server. Check rabbitmqctl cluster_status, should show all cluster nodes.

Revision history for this message
Sarath (nsarath) wrote :

RN: when nodes playing Contrail+Opestack got rebooted, may see controller container keep restarting with contrail services flapping and below is the workaround,
1) stop rabbitmq-server,
2) rm -rf /var/lib/rabbitmq/mnesia/
3) kill epmd process
4) start rabbitmq-server. Check rabbitmqctl cluster_status, should show all cluster nodes.

Sarath (nsarath)
Changed in juniperopenstack:
status: Incomplete → Won't Fix
kamlesh parmar (kparmar)
Changed in juniperopenstack:
status: Won't Fix → Invalid
Revision history for this message
Abhay Joshi (abhayj) wrote :

From: Ignatious Johnson <email address hidden>
Date: Tuesday, September 12, 2017 at 11:36 PM
To: Sachchidanand Vaidya <email address hidden>, Kamlesh Parmar <email address hidden>, Sarathbabu Narasimhan <email address hidden>
Cc: Sachin Bansal <email address hidden>, Jeba Paulaiyan <email address hidden>, Abhay Joshi <email address hidden>, Rudra Rugge <email address hidden>
Subject: Re: Bug #1716577

Hi Sachchidanand,

Kamlesh explained it clearly  as below,

force_boot

Ensure that the node will start next time, even if it was not the last to shut down.
Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not. But sometimes that's not possible: for instance if the entire cluster loses power then all nodes may think they were not the last to shut down.
In such a case you can invoke rabbitmqctl force_boot while the node is down. This will tell the node to unconditionally start next time you ask it to. If any changes happened to the cluster after this node shut down, they will be lost.
If the last node to go down is permanently lost then you should use rabbitmqctl forget_cluster_node --offline in preference to this command, as it will ensure that mirrored queues which were mastered on the lost node get promoted.
For example:
rabbitmqctl force_boot
This will force the node not to wait for other nodes next time it is started.

Possible customer cases follows,

1. setup where SM provisioned openstack node/rabbitmq and also controller container in same nodes. - only in our lab
2. Customer will not provision rabbit in same node, when using external rabbitmq server ( they will maintian a separate rabbit cluster and mange power outage properly)
3. Customer can skip external rabbitmq server and provison/use rabbit in controller container, in this case when controller nodes reboots, the ansible-internal-playbook will cluster/start the rabbitmq properly.

So we don’t need to worry about this bug.

Thanks,
Ignatious

Revision history for this message
Sarath (nsarath) wrote :

From: Sarathbabu Narasimhan
Sent: Wednesday, September 13, 2017 12:20 PM
To: Ignatious Johnson <email address hidden>; Sachchidanand Vaidya <email address hidden>; Kamlesh Parmar <email address hidden>
Cc: Sachin Bansal <email address hidden>; Jeba Paulaiyan <email address hidden>; Abhay Joshi <email address hidden>; Rudra Rugge <email address hidden>
Subject: RE: Bug #1716577

Okay, I agree usecase-2 & 3 highlighted by Ignatious don’t see this problems.
But, as per supported topologies and published documents as below,
https://github.com/Juniper/contrail-server-manager/wiki/Sample-JSONs-for-a-Six-Node-Contrail-HA-and-OpenStack-HA-Cluster

field may take usecase-1 as supported topology (same 3 nodes doing HA for both openstack & contrail) and may deploy it even
they never did it in past. But they can move to usecase-2 or 3 if they don’t want to see this problem.

Thanks
*Sarath

Revision history for this message
Sarath (nsarath) wrote :

RN: In usecase/topology with 3 node HA of both "Contrail + Openstack" roles played by same 3 nodes when rebooted may see containers controller and contrail services keep restarting. Workaround is given below,
1) stop rabbitmq-server,
2) rm -rf /var/lib/rabbitmq/mnesia/
3) kill epmd process
4) start rabbitmq-server. Check rabbitmqctl cluster_status, should show all cluster nodes.

tags: added: releasenote
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.