receiverd ACKs messages if it was interrupted

Bug #1383511 reported by Aleksandr Shaposhnikov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Dima Shulyak

Bug Description

[UPD]: receiverd sends ACKs for all messages even if it was interrupted in the middle of processing one. Because of that some messages do not get processed after receiverd is restarted. One of the consequences which happened in the test lab is the following:

During 100 nodes cluster deployment MOS/Fuel masternode didn't see that last controller was finished his deployment procedure.

Here is last lines of puppet logs from controller itself(about half hour ago):

Mon Oct 20 22:35:57 +0000 2014 Stage[main] (info): Evaluated in 0.36 seconds
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Starting to evaluate the resource
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Evaluated in 0.02 seconds
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Starting to evaluate the resource
Mon Oct 20 22:35:57 +0000 2014 Stage[glance-image] (info): Evaluated in 0.01 seconds
Mon Oct 20 22:35:57 +0000 2014 Puppet (debug): Finishing transaction 70365384691300
Mon Oct 20 22:35:57 +0000 2014 Puppet (debug): Storing state
Mon Oct 20 22:35:57 +0000 2014 Puppet (info): Creating state file /var/lib/puppet/state/state.yaml
Mon Oct 20 22:35:57 +0000 2014 Puppet (debug): Stored state in 0.16 seconds
Mon Oct 20 22:35:57 +0000 2014 Puppet (notice): Finished catalog run in 1270.24 seconds

Please find diagnostic snapshot and screenshot of Fuel UI in attachments.

Tags: scale
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :
tags: added: scale
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

It finally got deployed but after 35-40 minutes (just for 5 minutes which I was filing this bug +)

Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :
Changed in fuel:
milestone: none → 6.0
assignee: nobody → Fuel Astute Team (fuel-astute)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

During Astute logs message about correct status for both controllers send in time.

2014-10-20T22:35:52 debug: [432] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"78", "status"=>"ready", "role"=>"controller"}, {"uid"=>"85", "progress"=>56, "status"=>"deploying", "role"=>"controller"}]}
2014-10-20T22:36:00 debug: [432] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"85", "status"=>"ready", "role"=>"controller"}]}

Nailgun got this messages also in time:

2014-10-20 22:35:52.269 INFO [7f34f7f14700] (receiver) RPC method deploy_resp received: {"task_uuid": "24d4f50b-9775-42ca-a436-e0cacfa1e107", "nodes": [{"status": "ready", "progress": 100, "role": "controller", "uid": "78"}]}
2014-10-20 22:36:16.692 INFO [7f34f7f14700] (receiver) RPC method deploy_resp received: {"task_uuid": "24d4f50b-9775-42ca-a436-e0cacfa1e107", "nodes": [{"status": "ready", "progress": 100, "role": "controller", "uid": "85"}]}

Looks like UI problem.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
assignee: Vladimir Sharshov (vsharshov) → Fuel UI Team (fuel-ui)
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Vitaly Kramskikh (vkramskikh) wrote :

If/when this happens next time, please also attach output of /api/tasks . It is impossible to determine from diagnostic snapshot what is the root cause. But according to comment #2, most likely it is not an UI issue, as deployment finished successfully.

Revision history for this message
Vitaly Kramskikh (vkramskikh) wrote :

Python guys, please analyze diagnostic snapshot. If the bug is in nailgun, you'll find a reason there

Changed in fuel:
assignee: Fuel UI Team (fuel-ui) → Fuel Python Team (fuel-python)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Łukasz Oleś (loles)
Łukasz Oleś (loles)
Changed in fuel:
assignee: Łukasz Oleś (loles) → nobody
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

Happens again on 50 nodes deployment. All nodes are in Ready state.

Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

This time it took about 6 minutes to realize that cluster was deployed.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

After analyzing the code and the snapshot I found that it's definitely not a problem with the UI. Something kills the RPC receiver before it can store the update from the node.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Right now it's clear that it's supervisord that is killing Nailgun's RPC receiver (receiverd). It does not log any particular reason for doing that.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I can confirm that receiverd (and other nailgun processes) are getting periodically killed, but there are definitely no docker requests triggering this behavior. We may need to find out some way to audit processes to see who or what is killing these.

On my local env I don't have any such mysterious SIGINTs, which makes it hard to reproduce.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Does it require a different team's attention?

Revision history for this message
Dima Shulyak (dshulyak) wrote :

We can play with acknowledge logic for rabbitmq, to prevent any acks before message will be saved to database,
but we need a plan/ideas how to prevent this from happening.

Matthew, do we have any instrument to monitor such behaviour in prod?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/132560

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Dima Shulyak (dshulyak)
status: Confirmed → In Progress
Revision history for this message
Roman Prykhodchenko (romcheg) wrote : Re: Controller deployment completion wasn't seen by Fuel/MOS masternode

@Dima: This looks like fixing symptoms but the problem itself. As the logs show supervisord kills different services several times during that period of time. No one really can tell what kind of bugs can it actually cause.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

There is the service, which is randomly killing other processes whenever it wants. That is the problem which has to be fixed.

Revision history for this message
Dima Shulyak (dshulyak) wrote :

In my opinion it should be ok for service to be restarted, as long as restart doesnot lead to incosistent state in database and/or
lost messages.

I agree that problem with supervisord should be researched, but fixing acknowledge will increase reliability of receiver.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

@Dima: It's definitely not ok when there is something that is randomly killing processes without us knowing why and how to control that.

Dima Shulyak (dshulyak)
Changed in fuel:
status: In Progress → Confirmed
assignee: Dima Shulyak (dshulyak) → Roman Prykhodchenko (romcheg)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Update on this bug: This of course happens if the nailgun container is restarted, but it happened a few times in the MOS Scale Lab that supervisord killed some of the processes in the nailgun container. It's hard to reproduce, so I am waiting for the issue to appear again.

Changed in fuel:
assignee: Roman Prykhodchenko (romcheg) → Matthew Mosesohn (raytrac3r)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

It's going to take some time to reproduce this bug, unfortunately. I believe the fix provided for https://bugs.launchpad.net/fuel/+bug/1377105 will minimize the occurence of this bug, thuogh. I'm pushing for that to fix to get merged ASAP.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/132560
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=c439874d65e539c65e043578a3b40d218d11e9a1
Submitter: Jenkins
Branch: master

commit c439874d65e539c65e043578a3b40d218d11e9a1
Author: Dima Shulyak <email address hidden>
Date: Mon Nov 3 15:23:44 2014 +0200

    Explicitly requeue rabbitmq message in case process interupted

    In case if process is interrupted excplit requeue will be
    requested

    Dont think that we should consider dead-letter exchanges
    at this point, becauce messaging traffic is low, and only
    thing that we need is reliably process message after
    receiverd restarted

    DocImpact
    Change-Id: If530049b6e519ceb03de0cfd82c3449cdb616f50
    Related-Bug: 1383511

Revision history for this message
Mike Scherbakov (mihgen) wrote : Re: Controller deployment completion wasn't seen by Fuel/MOS masternode

Folks,
what is the latest update on the bug? Patch merged has "related" in its commit message...

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Mike, that patch only fixes one of all possible consequences of the wrong receiverd's behaviour. That particular consequence is described by this bug so maybe we can close it and open a new one, which describes receiverd's misbehaviour. What do you think?

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

The other option is to update title, tags and maybe change priority.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Roman, I agree with you. Let's close this ticket and open other receiverd's related

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Fuel Astute Team (fuel-astute)
Revision history for this message
Dima Shulyak (dshulyak) wrote :

@Vladimir, there nothing can be fixed in astute, regarding problem described in the bug

@Mike, it was Related-Bug because it only prevents nailgun-astute interaction from failing in case supervisord sends uncontrollable
SIGINT

Changed in fuel:
importance: High → Medium
assignee: Fuel Astute Team (fuel-astute) → Fuel Library Team (fuel-library)
milestone: 6.0 → 6.1
summary: - Controller deployment completion wasn't seen by Fuel/MOS masternode
+ Supervisord sends SIGINT signals for random reasons
Revision history for this message
Roman Prykhodchenko (romcheg) wrote : Re: Supervisord sends SIGINT signals for random reasons

So I made a conversation with Lukasz and we decided to close this bug changing the original title to the actual reason of wrong progress on the last controller. The bug which is describing wrong superwisord's behaviour could be found here: https://bugs.launchpad.net/fuel/+bug/1395753

summary: - Supervisord sends SIGINT signals for random reasons
+ receiverd ACKs messages if it was interrupted
description: updated
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dima Shulyak (dshulyak)
Changed in fuel:
status: Confirmed → Fix Committed
milestone: 6.1 → 6.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.