Fuel for OpenStack

Deploying ISO triggers to an Error: Failed tasks: Task[updatedb/3]Stopping the deployment process!

Bug #1580615 reported by Sergii Turivnyi on 2016-05-11

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Alex Schultz	Fuel for OpenStack 10.0
Mitaka	Fix Released	High	Alex Schultz	Fuel for OpenStack 9.0
Newton	Fix Committed	High	Alex Schultz	Fuel for OpenStack 10.0

Bug Description

Detailed bug description:
Deploying ISO triggers to an Error: Failed tasks: Task[updatedb/3]Stopping the deployment process!
ISO: #278 http://paste.openstack.org/show/496730/

Steps to reproduce:
  Deploy ISO with
    Storage Backends only Ceph RBD for volumes (Cinder) - On
    Nova quotas - On
    OpenStack debug logging - On
    Additional component Murano - On
    3 controllers
    2 compute + ceph

Expected results:
Deploy is finished successfully

Actual result:
Deploy is triggers to an Error

Reproducibility:
--

Workaround:
--

Impact:
--
Description of the environment:
10:47:08 VNC NODE-NAME
10:47:08 ----- -----------
10:47:08 5902 admin
10:47:08 5904 slave-01
10:47:08 5905 slave-02
10:47:08 5906 slave-03
10:47:08 5907 slave-04
10:47:08 5908 slave-05

Additional information:
See attachment
Snapshot: https://drive.google.com/open?id=0B8hkiEm94sEtNmtzbGNEY2pfekU

See original description

Tags:

Revision history for this message

Sergii Turivnyi (sturivnyi) wrote on 2016-05-11:

Selection_097.jpg Edit (145.1 KiB, image/jpeg)

Peter Zhurba (pzhurba) on 2016-05-13

tags:	added: area-devops
tags:	removed: area-puppet
Changed in fuel:
assignee:	nobody → Fuel DevOps (fuel-devops)
status:	New → Confirmed
assignee:	Fuel DevOps (fuel-devops) → Fuel Sustaining (fuel-sustaining-team)
tags:	added: area-build
tags:	added: area-library removed: area-build

Sergii Turivnyi (sturivnyi) on 2016-05-13

description:	updated
tags:	added: murano

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2016-05-13:

This looks like it is covered by https://review.openstack.org/#/c/310153/9

Passing to Oleksiy Molchanov

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-05-13:

Actually it was successful on all nodes

2016-05-11T13:11:38.195896+00:00 notice: (Scope(Class[Osnailyfacter::Ceph::Updatedb])) MODULAR: ceph/updatedb.pp
2016-05-11T13:11:38.256940+00:00 notice: (Scope(Class[Osnailyfacter::Ceph::Updatedb])) MODULAR: ceph/updatedb.pp
2016-05-11T13:11:39.932700+00:00 notice: Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11T13:11:39.932886+00:00 notice: Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11T13:11:56.169040+00:00 notice: (/Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns) executed successfully
2016-05-11T13:11:56.169199+00:00 notice: (/Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns) executed successfully

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Fuel Sustaining (fuel-sustaining-team)
tags:	added: area-python removed: area-library

Serg Melikyan (smelikyan) on 2016-05-13

tags:

added: area-murano
removed: murano

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2016-05-16:

updatedb on node3 failed because of Process not running but not empty lockfile is present. Trying to remove lockfile...ok.",

This looks like it's related to the astute puppet runner which recently added lockfile support. Moving to area-python

tags:

added: module-astute
removed: area-devops area-murano

Georgy Kibardin (gkibardin) on 2016-05-16

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)

Georgy Kibardin (gkibardin) on 2016-05-16

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2016-05-16:

Not deleted puppet lockfile seems to be a separate problem, not the cause of this one. This one requires puppet logs from the third node. Is there env to revert or something?

Changed in fuel:
status:	In Progress → Incomplete

Revision history for this message

Sergii Turivnyi (sturivnyi) wrote on 2016-05-17:

Unfortunately we do not have env

Georgy Kibardin (gkibardin) on 2016-05-19

Changed in fuel:
status:	Incomplete → In Progress

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2016-05-19:

The task updatedb has successfully completed, but for some reason timed out

Here is astute:
2016-05-11 13:10:55 INFO [2703] Task[updatedb/3]: Run on node: Node[3]
.....
2016-05-11 13:10:55 DEBUG [2703] Waiting for puppet to finish deployment on node 3 (timeout = 60 sec)...

Then puppet:
2016-05-11 13:11:37 +0000 Scope(Class[Osnailyfacter::Ceph::Updatedb]) (notice): MODULAR: ceph/updatedb.pp
2016-05-11 13:11:39 +0000 Puppet (notice): Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11 13:11:56 +0000 /Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns (notice): executed successfully

Then again astute:
2016-05-11 13:11:55 DEBUG [2703] Node[3]: Node 3: task updatedb, task status running
2016-05-11 13:11:55 WARNING [2703] Puppet agent 3 didn't respond within the allotted time
2016-05-11 13:11:55 DEBUG [2703] Task time summary: updatedb with status failed on node 3 took 00:01:00

By now, the question is what has puppet been doing about 40 sec before starting the task?

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-05-19:

The bug is not a blocker. Reducing priority to high according our criteria https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Triaging_the_bugs_in_the_Fuel_deployment_process

Changed in fuel:
importance:	Critical → High

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2016-05-19:

On nodes 1 and 2 there is also ~20 sec delay from puppet start to the moment when it started doing what it is intended to.

Georgy Kibardin (gkibardin) on 2016-05-19

Changed in fuel:
assignee:	Georgy Kibardin (gkibardin) → Fuel Library (Deprecated) (fuel-library)
assignee:	Fuel Library (Deprecated) (fuel-library) → Georgy Kibardin (gkibardin)

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-05-19:

#10

So a few things here, it appears the host that the task failed on was very loaded. In the logs, it shows that astute kicked off the puppet task at 13:10:55.

2016-05-11 13:10:55 INFO [2703] Task[updatedb/3]: Run on node: Node[3]

But the first log line for the task was at 13:11:37

2016-05-11 13:11:37 +0000 Scope(Class[Osnailyfacter::Ceph::Updatedb]) (notice): MODULAR: ceph/updatedb.pp
2016-05-11 13:11:39 +0000 Puppet (notice): Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11 13:11:56 +0000 /Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns (notice): executed successfully
2016-05-11 13:11:59 +0000 Puppet (notice): Finished catalog run in 8.53 seconds

So this means that it took ~42 seconds for the fact loading before it actually started to run the manifest.

Additionally if you look at the task that was being run it's just a single exec. The exec consists of a sed and an unless that performs a file existence test and a grep. The catalog compilation took ~3 seconds and the execution of the simple sed/grep took over 10 seconds. Unfortunately the combined slowness of this entire process exceeded the 60 second task timeout. So to address this issue the only thing we can really do is increase the task timeout to allow for heavily loaded environments. This should not be a problem in an actual production deployment since the hosts being deployed on should not be so loaded, but in CI this can be an issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-19: Fix proposed to fuel-library (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/318833

Changed in fuel:
assignee:	Georgy Kibardin (gkibardin) → Alex Schultz (alex-schultz)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-23: Fix merged to fuel-library (master)

#12

Reviewed: https://review.openstack.org/318833
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=8bab7e6e0bd5a40e899fe50b8fa1ee5df5d87f63
Submitter: Jenkins
Branch: master

commit 8bab7e6e0bd5a40e899fe50b8fa1ee5df5d87f63
Author: Alex Schultz <email address hidden>
Date: Thu May 19 12:01:32 2016 -0600

Increase updatedb timeout

    The current value of 60 seconds can be exceeded if the host that's being
    deployed on is overloaded. We have seen (in CI) that the combined time
    between fact loading and the exec can exceed 60 seconds if the system is
    overloaded on IO or CPU. This change increases the timeout slightly to
    account for possible hardward load issues and the execution of this task.

Change-Id: I713bedb1a3b1d1aafb5c66f92542e4a57d659e44
Closes-Bug: #1580615

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-23: Fix proposed to fuel-library (stable/mitaka)

#13

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/320029

Revision history for this message

Andrey Maximov (maximov) wrote on 2016-05-31:

#14

@Alex, I think that just increasing timeout won't fix the problem since CI tests should have time constraints. If the issue in CI host overbooking let's move this bug to infra team.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-05-31:

#15

If this happened more often with other tasks I would say yes, we should move it to infa. The issue here is that this task is specifically run on the osds and controllers so any IO increase or being run on a slower system could trigger this failure. It makes sense for this one to have a slight increase. 60 seconds is low.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-31: Fix merged to fuel-library (stable/mitaka)

#16

Reviewed: https://review.openstack.org/320029
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=25550bf5c9ba3ea974de12e4f97c563d43e6b8c5
Submitter: Jenkins
Branch: stable/mitaka

commit 25550bf5c9ba3ea974de12e4f97c563d43e6b8c5
Author: Alex Schultz <email address hidden>
Date: Thu May 19 12:01:32 2016 -0600

Increase updatedb timeout

    Change-Id: I713bedb1a3b1d1aafb5c66f92542e4a57d659e44
    Closes-Bug: #1580615
    (cherry picked from commit 8bab7e6e0bd5a40e899fe50b8fa1ee5df5d87f63)

Dmitriy Kruglov (dkruglov) on 2016-06-06

tags:

added: on-verification

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-06-06:

#17

Verified on MOS 9.0, build 445.
The issue is fixed.

ISO details:
root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
445
cat /etc/fuel_build_number:
445
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6347.noarch
fuel-provisioning-scripts-9.0.0-1.mos8723.noarch
fuel-bootstrap-cli-9.0.0-1.mos284.noarch
fuel-migrate-9.0.0-1.mos8430.noarch
rubygem-astute-9.0.0-1.mos748.noarch
network-checker-9.0.0-1.mos73.x86_64
fuel-mirror-9.0.0-1.mos137.noarch
fuel-openstack-metadata-9.0.0-1.mos8723.noarch
fuel-notify-9.0.0-1.mos8430.noarch
nailgun-mcagents-9.0.0-1.mos748.noarch
python-fuelclient-9.0.0-1.mos319.noarch
fuelmenu-9.0.0-1.mos271.noarch
fuel-9.0.0-1.mos6347.noarch
fuel-utils-9.0.0-1.mos8430.noarch
fuel-setup-9.0.0-1.mos6347.noarch
fuel-misc-9.0.0-1.mos8430.noarch
fuel-library9.0-9.0.0-1.mos8430.noarch
shotgun-9.0.0-1.mos90.noarch
fuel-agent-9.0.0-1.mos284.noarch
fuel-ui-9.0.0-1.mos2714.noarch
fuel-ostf-9.0.0-1.mos934.noarch
python-packetary-9.0.0-1.mos137.noarch
fuel-nailgun-9.0.0-1.mos8723.noarch