Deploying ISO triggers to an Error: Failed tasks: Task[updatedb/3]Stopping the deployment process!

Bug #1580615 reported by Sergii Turivnyi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alex Schultz
Mitaka
Fix Released
High
Alex Schultz
Newton
Fix Committed
High
Alex Schultz

Bug Description

Detailed bug description:
 Deploying ISO triggers to an Error: Failed tasks: Task[updatedb/3]Stopping the deployment process!
 ISO: #278 http://paste.openstack.org/show/496730/

Steps to reproduce:
  Deploy ISO with
    Storage Backends only Ceph RBD for volumes (Cinder) - On
    Nova quotas - On
    OpenStack debug logging - On
    Additional component Murano - On
    3 controllers
    2 compute + ceph

Expected results:
  Deploy is finished successfully

Actual result:
  Deploy is triggers to an Error

Reproducibility:
 --

Workaround:
 --

Impact:
 --
Description of the environment:
10:47:08 VNC NODE-NAME
10:47:08 ----- -----------
10:47:08 5902 admin
10:47:08 5904 slave-01
10:47:08 5905 slave-02
10:47:08 5906 slave-03
10:47:08 5907 slave-04
10:47:08 5908 slave-05

Additional information:
  See attachment
  Snapshot: https://drive.google.com/open?id=0B8hkiEm94sEtNmtzbGNEY2pfekU

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :
Peter Zhurba (pzhurba)
tags: added: area-devops
tags: removed: area-puppet
Changed in fuel:
assignee: nobody → Fuel DevOps (fuel-devops)
status: New → Confirmed
assignee: Fuel DevOps (fuel-devops) → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-build
tags: added: area-library
removed: area-build
description: updated
tags: added: murano
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This looks like it is covered by https://review.openstack.org/#/c/310153/9

Passing to Oleksiy Molchanov

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Actually it was successful on all nodes

2016-05-11T13:11:38.195896+00:00 notice: (Scope(Class[Osnailyfacter::Ceph::Updatedb])) MODULAR: ceph/updatedb.pp
2016-05-11T13:11:38.256940+00:00 notice: (Scope(Class[Osnailyfacter::Ceph::Updatedb])) MODULAR: ceph/updatedb.pp
2016-05-11T13:11:39.932700+00:00 notice: Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11T13:11:39.932886+00:00 notice: Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11T13:11:56.169040+00:00 notice: (/Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns) executed successfully
2016-05-11T13:11:56.169199+00:00 notice: (/Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns) executed successfully

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-python
removed: area-library
tags: added: area-murano
removed: murano
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

updatedb on node3 failed because of Process not running but not empty lockfile is present. Trying to remove lockfile...ok.",

This looks like it's related to the astute puppet runner which recently added lockfile support. Moving to area-python

tags: added: module-astute
removed: area-devops area-murano
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Not deleted puppet lockfile seems to be a separate problem, not the cause of this one. This one requires puppet logs from the third node. Is there env to revert or something?

Changed in fuel:
status: In Progress → Incomplete
Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

Unfortunately we do not have env

Changed in fuel:
status: Incomplete → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

The task updatedb has successfully completed, but for some reason timed out

Here is astute:
2016-05-11 13:10:55 INFO [2703] Task[updatedb/3]: Run on node: Node[3]
.....
2016-05-11 13:10:55 DEBUG [2703] Waiting for puppet to finish deployment on node 3 (timeout = 60 sec)...

Then puppet:
2016-05-11 13:11:37 +0000 Scope(Class[Osnailyfacter::Ceph::Updatedb]) (notice): MODULAR: ceph/updatedb.pp
2016-05-11 13:11:39 +0000 Puppet (notice): Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11 13:11:56 +0000 /Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns (notice): executed successfully

Then again astute:
2016-05-11 13:11:55 DEBUG [2703] Node[3]: Node 3: task updatedb, task status running
2016-05-11 13:11:55 WARNING [2703] Puppet agent 3 didn't respond within the allotted time
2016-05-11 13:11:55 DEBUG [2703] Task time summary: updatedb with status failed on node 3 took 00:01:00

By now, the question is what has puppet been doing about 40 sec before starting the task?

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

The bug is not a blocker. Reducing priority to high according our criteria https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Triaging_the_bugs_in_the_Fuel_deployment_process

Changed in fuel:
importance: Critical → High
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

On nodes 1 and 2 there is also ~20 sec delay from puppet start to the moment when it started doing what it is intended to.

Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Fuel Library (Deprecated) (fuel-library)
assignee: Fuel Library (Deprecated) (fuel-library) → Georgy Kibardin (gkibardin)
Revision history for this message
Alex Schultz (alex-schultz) wrote :

So a few things here, it appears the host that the task failed on was very loaded. In the logs, it shows that astute kicked off the puppet task at 13:10:55.

2016-05-11 13:10:55 INFO [2703] Task[updatedb/3]: Run on node: Node[3]

But the first log line for the task was at 13:11:37

2016-05-11 13:11:37 +0000 Scope(Class[Osnailyfacter::Ceph::Updatedb]) (notice): MODULAR: ceph/updatedb.pp
2016-05-11 13:11:39 +0000 Puppet (notice): Compiled catalog for node-3.test.domain.local in environment production in 2.97 seconds
2016-05-11 13:11:56 +0000 /Stage[main]/Osnailyfacter::Ceph::Updatedb/Exec[Ensure /var/lib/ceph in the updatedb PRUNEPATH]/returns (notice): executed successfully
2016-05-11 13:11:59 +0000 Puppet (notice): Finished catalog run in 8.53 seconds

So this means that it took ~42 seconds for the fact loading before it actually started to run the manifest.

Additionally if you look at the task that was being run it's just a single exec. The exec consists of a sed and an unless that performs a file existence test and a grep. The catalog compilation took ~3 seconds and the execution of the simple sed/grep took over 10 seconds. Unfortunately the combined slowness of this entire process exceeded the 60 second task timeout. So to address this issue the only thing we can really do is increase the task timeout to allow for heavily loaded environments. This should not be a problem in an actual production deployment since the hosts being deployed on should not be so loaded, but in CI this can be an issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/318833

Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Alex Schultz (alex-schultz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/318833
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=8bab7e6e0bd5a40e899fe50b8fa1ee5df5d87f63
Submitter: Jenkins
Branch: master

commit 8bab7e6e0bd5a40e899fe50b8fa1ee5df5d87f63
Author: Alex Schultz <email address hidden>
Date: Thu May 19 12:01:32 2016 -0600

    Increase updatedb timeout

    The current value of 60 seconds can be exceeded if the host that's being
    deployed on is overloaded. We have seen (in CI) that the combined time
    between fact loading and the exec can exceed 60 seconds if the system is
    overloaded on IO or CPU. This change increases the timeout slightly to
    account for possible hardward load issues and the execution of this task.

    Change-Id: I713bedb1a3b1d1aafb5c66f92542e4a57d659e44
    Closes-Bug: #1580615

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/320029

Revision history for this message
Andrey Maximov (maximov) wrote :

@Alex, I think that just increasing timeout won't fix the problem since CI tests should have time constraints. If the issue in CI host overbooking let's move this bug to infra team.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

If this happened more often with other tasks I would say yes, we should move it to infa. The issue here is that this task is specifically run on the osds and controllers so any IO increase or being run on a slower system could trigger this failure. It makes sense for this one to have a slight increase. 60 seconds is low.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/320029
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=25550bf5c9ba3ea974de12e4f97c563d43e6b8c5
Submitter: Jenkins
Branch: stable/mitaka

commit 25550bf5c9ba3ea974de12e4f97c563d43e6b8c5
Author: Alex Schultz <email address hidden>
Date: Thu May 19 12:01:32 2016 -0600

    Increase updatedb timeout

    The current value of 60 seconds can be exceeded if the host that's being
    deployed on is overloaded. We have seen (in CI) that the combined time
    between fact loading and the exec can exceed 60 seconds if the system is
    overloaded on IO or CPU. This change increases the timeout slightly to
    account for possible hardward load issues and the execution of this task.

    Change-Id: I713bedb1a3b1d1aafb5c66f92542e4a57d659e44
    Closes-Bug: #1580615
    (cherry picked from commit 8bab7e6e0bd5a40e899fe50b8fa1ee5df5d87f63)

tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 9.0, build 445.
The issue is fixed.

ISO details:
root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 445
cat /etc/fuel_build_number:
 445
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6347.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8723.noarch
 fuel-bootstrap-cli-9.0.0-1.mos284.noarch
 fuel-migrate-9.0.0-1.mos8430.noarch
 rubygem-astute-9.0.0-1.mos748.noarch
 network-checker-9.0.0-1.mos73.x86_64
 fuel-mirror-9.0.0-1.mos137.noarch
 fuel-openstack-metadata-9.0.0-1.mos8723.noarch
 fuel-notify-9.0.0-1.mos8430.noarch
 nailgun-mcagents-9.0.0-1.mos748.noarch
 python-fuelclient-9.0.0-1.mos319.noarch
 fuelmenu-9.0.0-1.mos271.noarch
 fuel-9.0.0-1.mos6347.noarch
 fuel-utils-9.0.0-1.mos8430.noarch
 fuel-setup-9.0.0-1.mos6347.noarch
 fuel-misc-9.0.0-1.mos8430.noarch
 fuel-library9.0-9.0.0-1.mos8430.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-agent-9.0.0-1.mos284.noarch
 fuel-ui-9.0.0-1.mos2714.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 python-packetary-9.0.0-1.mos137.noarch
 fuel-nailgun-9.0.0-1.mos8723.noarch

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.