Fuel for OpenStack

receiverd ACKs messages if it was interrupted

Bug #1383511 reported by Aleksandr Shaposhnikov on 2014-10-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	Medium	Dima Shulyak	Fuel for OpenStack 6.0

Bug Description

[UPD]: receiverd sends ACKs for all messages even if it was interrupted in the middle of processing one. Because of that some messages do not get processed after receiverd is restarted. One of the consequences which happened in the test lab is the following:

During 100 nodes cluster deployment MOS/Fuel masternode didn't see that last controller was finished his deployment procedure.

Here is last lines of puppet logs from controller itself(about half hour ago):

Mon Oct 20 22:35:57 +0000 2014 Stage[main] (info): Evaluated in 0.36 seconds
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Starting to evaluate the resource
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Evaluated in 0.02 seconds
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Starting to evaluate the resource
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Evaluated in 0.01 seconds
Mon Oct 20 22:35:57 +0000 2014 Puppet (debug): Finishing transaction 70365384691300
Mon Oct 20 22:35:57 +0000 2014 Puppet (debug): Storing state
Mon Oct 20 22:35:57 +0000 2014 Puppet (info): Creating state file /var/lib/puppet/state/state.yaml
Mon Oct 20 22:35:57 +0000 2014 Puppet (debug): Stored state in 0.16 seconds
Mon Oct 20 22:35:57 +0000 2014 Puppet (notice): Finished catalog run in 1270.24 seconds

Please find diagnostic snapshot and screenshot of Fuel UI in attachments.

See original description

Tags:

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2014-10-20:

Screen Shot 2014-10-20 at 15.59.13.png Edit (1.0 MiB, image/png)

tags:

added: scale

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2014-10-20:

It finally got deployed but after 35-40 minutes (just for 5 minutes which I was filing this bug +)

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2014-10-20:

Diagnostic snapshot Edit (132.2 MiB, application/x-tar)

Nastya Urlapova (aurlapova) on 2014-10-21

Changed in fuel:
milestone:	none → 6.0
assignee:	nobody → Fuel Astute Team (fuel-astute)

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-10-22:

During Astute logs message about correct status for both controllers send in time.

2014-10-20T22:35:52 debug: [432] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"78", "status"=>"ready", "role"=>"controller"}, {"uid"=>"85", "progress"=>56, "status"=>"deploying", "role"=>"controller"}]}
2014-10-20T22:36:00 debug: [432] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"85", "status"=>"ready", "role"=>"controller"}]}

Nailgun got this messages also in time:

2014-10-20 22:35:52.269 INFO [7f34f7f14700] (receiver) RPC method deploy_resp received: {"task_uuid": "24d4f50b-9775-42ca-a436-e0cacfa1e107", "nodes": [{"status": "ready", "progress": 100, "role": "controller", "uid": "78"}]}
2014-10-20 22:36:16.692 INFO [7f34f7f14700] (receiver) RPC method deploy_resp received: {"task_uuid": "24d4f50b-9775-42ca-a436-e0cacfa1e107", "nodes": [{"status": "ready", "progress": 100, "role": "controller", "uid": "85"}]}

Looks like UI problem.

Changed in fuel:
assignee:	Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
assignee:	Vladimir Sharshov (vsharshov) → Fuel UI Team (fuel-ui)

Bogdan Dobrelya (bogdando) on 2014-10-23

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Vitaly Kramskikh (vkramskikh) wrote on 2014-10-27:

If/when this happens next time, please also attach output of /api/tasks . It is impossible to determine from diagnostic snapshot what is the root cause. But according to comment #2, most likely it is not an UI issue, as deployment finished successfully.

Revision history for this message

Vitaly Kramskikh (vkramskikh) wrote on 2014-10-27:

Python guys, please analyze diagnostic snapshot. If the bug is in nailgun, you'll find a reason there

Changed in fuel:
assignee:	Fuel UI Team (fuel-ui) → Fuel Python Team (fuel-python)

Tomasz 'Zen' Napierala (tzn) on 2014-10-28

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Łukasz Oleś (loles)

Łukasz Oleś (loles) on 2014-10-28

Changed in fuel:
assignee:	Łukasz Oleś (loles) → nobody
assignee:	nobody → Fuel Python Team (fuel-python)

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2014-10-28:

Screen Shot 2014-10-28 at 13.43.34.png Edit (1.3 MiB, image/png)

Happens again on 50 nodes deployment. All nodes are in Ready state.

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2014-10-28:

This time it took about 6 minutes to realize that cluster was deployed.

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-10-30:

After analyzing the code and the snapshot I found that it's definitely not a problem with the UI. Something kills the RPC receiver before it can store the update from the node.

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-10-31:

#10

Right now it's clear that it's supervisord that is killing Nailgun's RPC receiver (receiverd). It does not log any particular reason for doing that.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-10-31:

#11

I can confirm that receiverd (and other nailgun processes) are getting periodically killed, but there are definitely no docker requests triggering this behavior. We may need to find out some way to audit processes to see who or what is killing these.

On my local env I don't have any such mysterious SIGINTs, which makes it hard to reproduce.

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-03:

#12

Does it require a different team's attention?

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-03:

#13

We can play with acknowledge logic for rabbitmq, to prevent any acks before message will be saved to database,
but we need a plan/ideas how to prevent this from happening.

Matthew, do we have any instrument to monitor such behaviour in prod?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-03: Fix proposed to fuel-web (master)

#14

Fix proposed to branch: master
Review: https://review.openstack.org/132560

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Dima Shulyak (dshulyak)
status:	Confirmed → In Progress

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-04: Re: Controller deployment completion wasn't seen by Fuel/MOS masternode

#15

@Dima: This looks like fixing symptoms but the problem itself. As the logs show supervisord kills different services several times during that period of time. No one really can tell what kind of bugs can it actually cause.

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-04:

#16

There is the service, which is randomly killing other processes whenever it wants. That is the problem which has to be fixed.

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-04:

#17

In my opinion it should be ok for service to be restarted, as long as restart doesnot lead to incosistent state in database and/or
lost messages.

I agree that problem with supervisord should be researched, but fixing acknowledge will increase reliability of receiver.

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-05:

#18

@Dima: It's definitely not ok when there is something that is randomly killing processes without us knowing why and how to control that.

Dima Shulyak (dshulyak) on 2014-11-05

Changed in fuel:
status:	In Progress → Confirmed
assignee:	Dima Shulyak (dshulyak) → Roman Prykhodchenko (romcheg)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-11-13:

#19

Update on this bug: This of course happens if the nailgun container is restarted, but it happened a few times in the MOS Scale Lab that supervisord killed some of the processes in the nailgun container. It's hard to reproduce, so I am waiting for the issue to appear again.

Changed in fuel:
assignee:	Roman Prykhodchenko (romcheg) → Matthew Mosesohn (raytrac3r)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-11-13:

#20

It's going to take some time to reproduce this bug, unfortunately. I believe the fix provided for https://bugs.launchpad.net/fuel/+bug/1377105 will minimize the occurence of this bug, thuogh. I'm pushing for that to fix to get merged ASAP.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-21: Related fix merged to fuel-web (master)

#21

Reviewed: https://review.openstack.org/132560
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=c439874d65e539c65e043578a3b40d218d11e9a1
Submitter: Jenkins
Branch: master

commit c439874d65e539c65e043578a3b40d218d11e9a1
Author: Dima Shulyak <email address hidden>
Date: Mon Nov 3 15:23:44 2014 +0200

Explicitly requeue rabbitmq message in case process interupted

In case if process is interrupted excplit requeue will be
requested

    Dont think that we should consider dead-letter exchanges
    at this point, becauce messaging traffic is low, and only
    thing that we need is reliably process message after
    receiverd restarted

    DocImpact
    Change-Id: If530049b6e519ceb03de0cfd82c3449cdb616f50
    Related-Bug: 1383511

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-11-21: Re: Controller deployment completion wasn't seen by Fuel/MOS masternode

#22

Folks,
what is the latest update on the bug? Patch merged has "related" in its commit message...

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-24:

#23

Mike, that patch only fixes one of all possible consequences of the wrong receiverd's behaviour. That particular consequence is described by this bug so maybe we can close it and open a new one, which describes receiverd's misbehaviour. What do you think?

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-24:

#24

The other option is to update title, tags and maybe change priority.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-11-24:

#25

Roman, I agree with you. Let's close this ticket and open other receiverd's related

Changed in fuel:
assignee:	Matthew Mosesohn (raytrac3r) → Fuel Astute Team (fuel-astute)

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-24:

#26

@Vladimir, there nothing can be fixed in astute, regarding problem described in the bug

@Mike, it was Related-Bug because it only prevents nailgun-astute interaction from failing in case supervisord sends uncontrollable
SIGINT

Changed in fuel:
importance:	High → Medium
assignee:	Fuel Astute Team (fuel-astute) → Fuel Library Team (fuel-library)
milestone:	6.0 → 6.1
summary:	- Controller deployment completion wasn't seen by Fuel/MOS masternode + Supervisord sends SIGINT signals for random reasons

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-11-24: Re: Supervisord sends SIGINT signals for random reasons

#27

So I made a conversation with Lukasz and we decided to close this bug changing the original title to the actual reason of wrong progress on the last controller. The bug which is describing wrong superwisord's behaviour could be found here: https://bugs.launchpad.net/fuel/+bug/1395753

summary:	- Supervisord sends SIGINT signals for random reasons + receiverd ACKs messages if it was interrupted
description:	updated
Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Dima Shulyak (dshulyak)

Roman Prykhodchenko (romcheg) on 2014-11-24

Changed in fuel:
status:	Confirmed → Fix Committed
milestone:	6.1 → 6.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.