controller node couldn't reach rabbitmq of fuel node during deployment

Bug #1431386 reported by Leontiy Istomin
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Triaged
Medium
Fuel Sustaining
6.1.x
Won't Fix
High
Fuel Python (Deprecated)
7.0.x
Won't Fix
High
Fuel Python (Deprecated)
Mitaka
Won't Fix
Medium
Fuel Python (Deprecated)
Newton
Triaged
Medium
Fuel Sustaining

Bug Description

api: '1.0'
astute_sha: 1be5b9b827f512d740fe907c7ff72486d4030938
auth_required: true
build_id: 2015-03-02_14-00-04
build_number: '154'
feature_groups:
- mirantis
fuellib_sha: b17e3810dbca407fca2a231c26f553a46e166343
fuelmain_sha: baf24424a4e056c6753913de5f8c94851903f718
nailgun_sha: f034fbb4b68be963e4dc5b5d680061b54efbf605
ostf_sha: 103d6cf6badd57b791cfaf4310ec8bd81c7a8a46
production: docker
python-fuelclient_sha: 3ebfa9c14a192d0298ff787526bf990055a23694
release: '6.1'

I tried to deploy the following configuration with 200 baremetal nodes:
Baremetal, Ubuntu, HA, Neutron-gre,Ceph-all, Debug, 6.1_154
Controllers:3 Computes:197

I got the following error during deployment cluster:
Mcollective problem with nodes [{"uid"=>"1", "error"=>"Node not answered by RPC."}], please check log for details
Nodes even haven't been provisoned. They are still in bootstrap.

From mcollective log of controller node with uid=1:
ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.20.0.2:61613: es_oldrecv: receive failed: Connection reset by peer

Revision history for this message
Leontiy Istomin (listomin) wrote :
description: updated
description: updated
description: updated
Changed in fuel:
milestone: none → 6.1
assignee: nobody → Vladimir Sharshov (vsharshov)
importance: Undecided → High
Revision history for this message
Leontiy Istomin (listomin) wrote :

rabbitmq is seem ok:

[root@fuel ~]# dockerctl shell rabbitmq
[root@ee1f879b6011 ~]# rabbitmqctl status
Status of node rabbit@ee1f879b6011 ...
[{pid,1510},
 {running_applications,
     [{rabbitmq_stomp,"Embedded Rabbit Stomp Adapter","3.3.5"},
      {rabbitmq_management,"RabbitMQ Management Console","3.3.5"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.3.5"},
      {rabbit,"RabbitMQ","3.3.5"},
      {os_mon,"CPO CXC 138 46","2.2.7"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.3.5"},
      {webmachine,"webmachine","1.10.3-rmq3.3.5-gite9359c7"},
      {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.3.5-git680dba8"},
      {amqp_client,"RabbitMQ AMQP Client","3.3.5"},
      {xmerl,"XML parser","1.2.10"},
      {inets,"INETS CXC 138 49","5.7.1"},
      {mnesia,"MNESIA CXC 138 12","4.5"},
      {sasl,"SASL CXC 138 11","2.1.10"},
      {stdlib,"ERTS CXC 138 10","1.17.5"},
      {kernel,"ERTS CXC 138 10","2.14.5"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:6:6] [rq:6] [async-threads:30] [kernel-poll:true]\n"},
 {memory,
     [{total,208516560},
      {connection_procs,8229392},
      {queue_procs,34821296},
      {plugins,10775712},
      {other_proc,17252640},
      {mnesia,5254824},
      {mgmt_db,15881128},
      {msg_index,2102624},
      {other_ets,3191376},
      {binary,20071880},
      {code,17784544},
      {atom,1625361},
      {other_system,71525783}]},
 {alarms,[]},
 {listeners,[{clustering,41055,"::"},{amqp,5672,"::"},{stomp,61613,"::"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,3355982233},
 {disk_free_limit,50000000},
 {disk_free,9421512704},
 {file_descriptors,
     [{total_limit,102300},
      {total_used,217},
      {sockets_limit,92068},
      {sockets_used,214}]},
 {processes,[{limit,1048576},{used,6397}]},
 {run_queue,0},
 {uptime,64870}]
...done.

[root@ee1f879b6011 ~]# rabbitmqctl list_queues
Listing queues ...
amq.gen-2wXBIlw1FIzdmdxzF-sG3Q 0
amq.gen-9eKeOiM17ib56w8WpHxvZw 0
amq.gen-G_HXOSSZYDDM4NSBAsBPUg 0
amq.gen-TIXr-6FXTbBHboSv9e4frw 0
amq.gen-pONboF2R3fXeOOTTfkYsBw 0
amq.gen-yKvDYTBIbt0XbrRrtObLsQ 0
amq.gen-ym_nLWc_QnljJYUqqsAJ0A 0
nailgun 0
naily 0
...done.

description: updated
Revision history for this message
Łukasz Oleś (loles) wrote :

What's the status of node-1?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Łukasz Oleś (loles) wrote :

From mcollecitve log on node-1:

2015-03-11T20:30:42.657620+00:00 debug: 20:30:42.737132 #8391] DEBUG -- : rabbitmq.rb:225:in `receive' Waiting for a message from RabbitMQ
2015-03-11T20:30:42.657878+00:00 debug: 20:30:42.737740 #8391] DEBUG -- : runnerstats.rb:49:in `received' Incrementing total stat
2015-03-11T20:30:42.657993+00:00 debug: 20:30:42.737853 #8391] DEBUG -- : pluginmanager.rb:83:in `[]' Returning cached plugin security_plugin with class MCollective::Security::Psk
2015-03-11T20:30:42.658165+00:00 warning: 20:30:42.738090 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: Failed to handle message: TypeError: incompatible marshal file format (can't be read)
2015-03-11T20:30:42.658258+00:00 info: format version 4.8 required; 89.111 given
2015-03-11T20:30:42.658417+00:00 warning: 20:30:42.738166 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/libexec/mcollective/mcollective/security/psk.rb:22:in `load'
2015-03-11T20:30:42.658472+00:00 warning: 20:30:42.738228 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/libexec/mcollective/mcollective/security/psk.rb:22:in `decodemsg'
2015-03-11T20:30:42.658671+00:00 warning: 20:30:42.738287 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/message.rb:178:in `decode!'
2015-03-11T20:30:42.658759+00:00 warning: 20:30:42.738345 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/runner.rb:121:in `receive'
2015-03-11T20:30:42.658861+00:00 warning: 20:30:42.738402 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/runner.rb:55:in `run'
2015-03-11T20:30:42.659087+00:00 warning: 20:30:42.738461 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/runner.rb:53:in `loop'
2015-03-11T20:30:42.659205+00:00 warning: 20:30:42.738518 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/runner.rb:53:in `run'
2015-03-11T20:30:42.659321+00:00 warning: 20:30:42.738586 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/unix_daemon.rb:30:in `daemonize_runner'
2015-03-11T20:30:42.659421+00:00 warning: 20:30:42.738650 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/unix_daemon.rb:13:in `daemonize'
2015-03-11T20:30:42.659586+00:00 warning: 20:30:42.738709 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/unix_daemon.rb:5:in `fork'
2015-03-11T20:30:42.659692+00:00 warning: 20:30:42.738767 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/unix_daemon.rb:5:in `daemonize'
2015-03-11T20:30:42.659785+00:00 warning: 20:30:42.738825 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/lib/ruby/site_ruby/1.8/mcollective/unix_daemon.rb:20:in `daemonize_runner'
2015-03-11T20:30:42.659969+00:00 warning: 20:30:42.738889 #8391] WARN -- : psk.rb:22:in `decodemsg' PLMC10: /usr/sbin/mcollectived:47

Revision history for this message
Sergey Galkin (sgalkin) wrote :

The node-1 in the error state

----|----------|------------------|---------|-------------|-------------------|-------------------|---------------|--------|---------
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
1 | error | Untitled (07:e2) | 2 | 10.20.1.1 | 0c:c4:7a:1f:07:e2 | controller | | True | 2

is accessible

[root@fuel ~]# ssh 10.20.1.1
Warning: Permanently added '10.20.1.1' (RSA) to the list of known hosts.
Last login: Thu Mar 12 18:03:15 2015 from 10.20.0.2
[root@bootstrap ~]#

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Main problem in this env on node 1. Different credentials for mcollective.

On node-1

plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.1.password= FGDVcrol

On other nodes:

plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.1.password= lXZ1QEWf

Looks like first node was booted on other master node and after it was not rebooted at all.

I simple rebooted node-1 and this node have been got correct config.

mco ping before:
199 replies max: 256.11 min: 71.64 avg: 171.61

after:
200 replies max: 422.21 min: 126.38 avg: 316.06

Changed in fuel:
status: Confirmed → Invalid
Changed in fuel:
status: Invalid → Confirmed
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Here is list of steps to reproduce the bug

1. Fuel-master was set up, lets call it fuel-master-1
2. Fuel-master-1 discovered VMs with all necessary stuff which is needed to work with fuel-master-1 (including credentials to rabbitmq)
3. For some reasons fuel-master-1 was reinstalled into a new fuel-master-2 (which have generated new credentials to rabbitmq).
4. VMs that were discovered from fuel-master-1, connected to fuel-master-2, because fuel-master-2 has same ip addresses using nailgun-agent which doesn't require any authentication.
5. Credentials required to access rabbitmq, which are required for mcollective, mismatched.
6. Deployment fails.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Bug summary is too generic, please rename to narrow down the matching symptoms.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bug can be worked around by rebooting bootstrap nodes after master node replacement or by fixing mcollective config on already deployed nodes in case of spontaneous mater node replacement.

tags: added: tricky
Igor Marnat (imarnat)
tags: added: fuel-to-mos
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/215571

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/215572

Andrey Maximov (maximov)
tags: added: feature
Revision history for this message
Andrey Maximov (maximov) wrote :

we need to detect a case when bootstrap nodes connect to incompatible masternode and warn user that he/she should reboot bootstrap nodes. Typically it happens when you upgrade master node and in result you will get a combination of old bootstrap nodes and new master node.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Copy-pasting review comment from Evgeniy L:

"I'm strongly against of rebooting user's machines in the agent, agent's purpose is to only report the data, it mustn't reboot the system, it's too dangerous.

I think it will be more safe to pass master_node_uid in node's metadata, and reject creation of such nodes by Nailgun on API level."

I agree with Evgeniy, it seems to be the most appropriate solution for this case.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

And another comment from Evgeniy with more detailed explanation:

"We follow the next approaches depending on the task:

1. ask user to reboot nodes explicitly, we can notify user that there are nodes from another master node and ask him/her to reboot those bootstrap nodes

2. for other cases we do reboot from master node only, only Nailgun knows everything about nodes and in what state they are, so we do reboot in case of provisioning for example. It won't work here, because MCollective is not going to work.

To fix this particular issue, it's better to reject nodes creation on API level and create notification, with explanation, that some nodes, which were booted/installed with another fuel installation, are trying to perform discovery."

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Dmitry is no longer working on this. Need a new fuel-python dev to pick this up.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Fuel Python Team (fuel-python)
milestone: 7.0 → 8.0
no longer affects: fuel/8.0.x
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Dmitry Mescheryakov (dmitryme) (<email address hidden>) on branch: master
Review: https://review.openstack.org/215571
Reason: Not a complete solution, abandoning it so that it does not hang here forever

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-nailgun-agent (master)

Change abandoned by Dmitry Mescheryakov (dmitryme) (<email address hidden>) on branch: master
Review: https://review.openstack.org/215572
Reason: Not a complete solution, abandoning it so that it does not hang here forever

Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

We passed SCF in 8.0. Moving the bug to 9.0.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.