Nailgun doesn't respond during a minute after starting deploy of 200 nodes

Bug #1569859 reported by Leontiy Istomin
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Fuel Toolbox
8.0.x
Won't Fix
High
Sergii Rizvan
Mitaka
Fix Released
High
Fuel Toolbox

Bug Description

Detailed bug description:
After clicking "deploy changes" nailgun service doesn't respond during a minute.
from UI we can see "Nailgun is Unavailable" message (screen is attached). After a minute we car reach UI and can see that deplyment started.
from /var/log/nginx/error_services.log http://paste.openstack.org/show/493927/
atop at the time: http://paste.openstack.org/show/493926/
Steps to reproduce:
1. Configure env with 200 nodes and click "deploy changes"
Expected results:
UI shows that deployment started
Actual result:
UI shows error, but after a minute we can reach UI and see that deplotment is actually started
Reproducibility:
reproduced each time
Workaround:
Hasn't been found yet
Impact:
UI looks bad
Description of the environment:
 Operation system: Ubuntu
 Versions of components: mos9.0 and fuel agent from mos10.0
 Reference architecture: 3 controllers, 20 computes+ceph, 178 computes
 Network model: vxlan+Neutron L2 population+DVR
 Related projects installed: no
Additional information:
fuel-9.0-188

Log folder from fuel node is here: http://mos-scale-share.mirantis.com/reconnect_web_fuel_logs.tar.gz

Revision history for this message
Leontiy Istomin (listomin) wrote :
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
milestone: none → 9.0
importance: Undecided → High
status: New → Confirmed
tags: added: area-python
description: updated
description: updated
tags: added: team-enhancements
Revision history for this message
Bulat Gaifullin (bulat.gaifullin) wrote :

the deployment graph serialisation is doing in parallel thread, but request waits until this job will be done.
the request should not wait completion of deployment graph serialisation.

Ilya Kutukov (ikutukov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Ilya Kutukov (ikutukov)
Changed in fuel:
milestone: 9.0 → 10.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/307824

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Leontiy Istomin (listomin) wrote :

Tried this fix, but no changes:
1. deploy fuel 9.0-
2. Add fuel-agent from master due https://bugs.launchpad.net/fuel/+bug/1543233, regen ubuntu repo, built and activate bootstrap image
3. Apply this https://review.openstack.org/#/c/307361/1/nailgun/nailgun/db/sqlalchemy/models/task.py due https://bugs.launchpad.net/fuel/+bug/1570509
4. Apply https://review.openstack.org/#/c/305812/2/deployment/puppet/l23network/manifests/l2/bond.pp due https://bugs.launchpad.net/fuel/+bug/1566974
5. service nailgun restart
6. boot nodes and deploy env: 3 controller, 20 computes+Ceph, 178 computes
7. nailgun was unreachable during few minutes:
"Nailgun is unavailable" error in UI
[root@fuel ~]# fuel node
504 Server Error: Gateway Time-out for url: http://10.20.0.2:8000/api/v1/version (<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
)
atop at the time: http://paste.openstack.org/show/494670/

log folder from Fuel node: http://mos-scale-share.mirantis.com/fuel_logs_2016_19_15_52.tar.gz

Revision history for this message
Ilya Kutukov (ikutukov) wrote :
Download full text (114.2 KiB)

Here is what's going on when Nailgun stuck in app.log:

[pid: 28496|app: 0|req: 763/8891] 10.20.0.20 () {38 vars in 540 bytes} [Tue Apr 19 15:36:47 2016] PUT /api/nodes/agent/ => generated 10 bytes in 151 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0)
[pid: 28492|app: 0|req: 926/8892] 10.20.0.105 () {38 vars in 541 bytes} [Tue Apr 19 15:36:47 2016] PUT /api/nodes/agent/ => generated 11 bytes in 147 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0)
2016-04-19 15:36:47.367 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (df:30)
[pid: 28494|app: 0|req: 1008/8893] 10.20.0.141 () {38 vars in 541 bytes} [Tue Apr 19 15:36:47 2016] PUT /api/nodes/agent/ => generated 11 bytes in 151 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0)
2016-04-19 15:36:47.461 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (64:d4)
[pid: 28490|app: 0|req: 1569/8894] 10.20.0.57 () {38 vars in 540 bytes} [Tue Apr 19 15:36:47 2016] PUT /api/nodes/agent/ => generated 10 bytes in 247 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0)
2016-04-19 15:36:47.502 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (64:cc)
[pid: 28495|app: 0|req: 1897/8895] 10.20.0.56 () {38 vars in 540 bytes} [Tue Apr 19 15:36:47 2016] PUT /api/nodes/agent/ => generated 10 bytes in 152 msecs (HTTP/1.1 200) 4 headers in 185 bytes (4 switches on core 0)
2016-04-19 15:36:47.758 INFO [7f830c314880] (task) The cluster transaction is initiated.
2016-04-19 15:36:47.758 INFO [7f830c314880] (task) cluster serialization is started.
2016-04-19 15:36:47.952 INFO [7f830c314880] (manager) Assigning IP for node '94' in network 'private'
2016-04-19 15:36:47.961 INFO [7f830c314880] (manager) Assigning IP for node '94' in network 'management'
2016-04-19 15:36:47.967 INFO [7f830c314880] (manager) Assigning IP for node '94' in network 'storage'
2016-04-19 15:36:47.977 INFO [7f830c314880] (manager) Assigning IP for node '180' in network 'private'
2016-04-19 15:36:47.982 INFO [7f830c314880] (manager) Assigning IP for node '180' in network 'management'
2016-04-19 15:36:47.988 INFO [7f830c314880] (manager) Assigning IP for node '180' in network 'storage'
2016-04-19 15:36:47.995 INFO [7f830c314880] (manager) Assigning IP for node '17' in network 'private'
2016-04-19 15:36:48.001 INFO [7f830c314880] (manager) Assigning IP for node '17' in network 'management'
2016-04-19 15:36:48.006 INFO [7f830c314880] (manager) Assigning IP for node '17' in network 'storage'
2016-04-19 15:36:48.014 INFO [7f830c314880] (manager) Assigning IP for node '27' in network 'private'
2016-04-19 15:36:48.020 INFO [7f830c314880] (manager) Assigning IP for node '27' in network 'management'
2016-04-19 15:36:48.025 INFO [7f830c314880] (manager) Assigning IP for node '27' in network 'storage'
2016-04-19 15:36:48.032 INFO [7f830c314880] (manager) Assigning IP for node '29' in network 'private'
2016-04-19 15:36:48.038 INFO [7f830c314880] (manager) Assigning IP for node '29' in network 'management'
2016-04-19 15:36:48.046 INFO [7f830c314880] (manager) Assigning IP for node '29' in network 'storage'
2016-04-19 15:36:48.057 INFO [7f83...

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

and some tail

2016-04-19 15:44:21.451 INFO [7f830c314880] (task) tasks serialization is finished.
2016-04-19 15:44:21.451 INFO [7f830c314880] (task) tasks history saving is started.
2016-04-19 15:44:23.902 INFO [7f830c314880] (task) tasks history saving is finished.
2016-04-19 15:44:28.492 WARNING [7f830c314880] (manager) Skip VIP 'baremetal' which is mapped to non-existing network 'baremetal'
2016-04-19 15:44:31.908 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (e0:d8)
2016-04-19 15:44:31.943 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (67:30)
[pid: 28492|app: 0|req: 927/8896] 10.20.0.156 () {38 vars in 541 bytes} [Tue Apr 19 15:36:47 2016] PUT /api/nodes/agent/ => generated 11 bytes in 464096 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0)
2016-04-19 15:44:31.959 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (62:8c)
2016-04-19 15:44:31.980 INFO [7f830c314880] (manager) Interfaces removed from node Untitled (74:20)

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

The current reason of the problem is that uwsgidecorators package is absent in the environment and Nailgun has falled back to the synchronous tasks execution.

Revision history for this message
Leontiy Istomin (listomin) wrote :

The issue has been reproduced the following way:
1. Install Fuel 9.0-244
2. add fuel-agent into ubuntu repo on fuel and rebuild bootstrap image due https://bugs.launchpad.net/fuel/+bug/1543233
3. Patch nailgun https://review.openstack.org/gitweb?p=openstack/fuel-web.git;a=patch;h=d2ec3c1e49e87958b71eca9f65dc1674f444f907 due https://bugs.launchpad.net/fuel/+bug/1570509
4. apply changes https://review.openstack.org/gitweb?p=openstack/fuel-web.git;a=patch;h=67d97c00632601a2d5d4dc8100906cc1543bd3f4
(manager.py is attached. there was manual changing of the file)
5. install uwsgi uwsgi-plugin-python python-uwsgidecorator:
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
rpm -Uvh epel-release-7*.rpm
sed -i s/^enabled=1/enabled=0/g /etc/yum.repos.d/epel.repo
yum --enablerepo=epel install uwsgi uwsgi-plugin-python python-uwsgidecorator
6. restart nailgun and receiverd

After configuring and clicking "deploy changes" nailgun was unreachable during ~10 minutes

log folder from fuel http://mos-scale-share.mirantis.com/fuel_logs_1569859_25-04-2016.tar.gz

Revision history for this message
Leontiy Istomin (listomin) wrote :

ps aux output

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

My patch has fixed async task run but it seems that locking occurs (besides) in other place (db lock, resources concurrency lock e.t.c.)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/314701

Dmitry Pyzhov (dpyzhov)
tags: added: release-notes
Revision history for this message
Ilya Kutukov (ikutukov) wrote :

Action steps:
* Apply GDB on local test stand (5-10 nodes deployment)
* Make stacktrace sampler script for scale lab using uWSGI Python Tracebacker http://uwsgi-docs.readthedocs.io/en/latest/Tracebacker.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/mitaka)

Change abandoned by Vladimir Sharshov (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/314701
Reason: This review was created only for debug purpose

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

ETA 06/06/2016

Changed in fuel:
assignee: Ilya Kutukov (ikutukov) → Bulat Gaifullin (bgaifullin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/307824
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=2096a25b23aa0f79a9b46fd8695f696c9610c10c
Submitter: Jenkins
Branch: master

commit 2096a25b23aa0f79a9b46fd8695f696c9610c10c
Author: Ilya Kutukov <email address hidden>
Date: Tue Apr 19 16:40:03 2016 +0300

    Deployment and provision Nailgun tasks is now run asynchrously

    Deployment and provisioning preparations could last a long when Fuel is
    working with a hundreds of nodes. But operator should be able to
    observe cluster state and operate other clusters.

    Change-Id: I73802c91f93a46855b006cb05a8b5722109e9e6a
    Partial-Bug: #1569859

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/mitaka)

Reviewed: https://review.openstack.org/314701
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=eb829f1bec9920268338e8c3a76cdeb5d7a1a636
Submitter: Jenkins
Branch: stable/mitaka

commit eb829f1bec9920268338e8c3a76cdeb5d7a1a636
Author: Ilya Kutukov <email address hidden>
Date: Tue Apr 19 16:40:03 2016 +0300

    Deployment and provision Nailgun tasks is now run asynchrously

    Deployment and provisioning preparations could last a long when Fuel is
    working with a hundreds of nodes. But operator should be able to
    observe cluster state and operate other clusters.

    Change-Id: I73802c91f93a46855b006cb05a8b5722109e9e6a
    Partial-Bug: #1569859
    (cherry picked from commit 2096a25b23aa0f79a9b46fd8695f696c9610c10c)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/325877

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

Patch is commited
Packaging bugs:

master https://bugs.launchpad.net/fuel/+bug/1572997 (commited)
stable/mitaka https://bugs.launchpad.net/fuel/+bug/1572998 (released)
8.0 https://bugs.launchpad.net/fuel/+bug/1589488 (new)

Scaling team should verify this fix.

Ilya Kutukov (ikutukov)
Changed in fuel:
assignee: Bulat Gaifullin (bgaifullin) → Ilya Kutukov (ikutukov)
status: In Progress → Fix Committed
Revision history for this message
Leontiy Istomin (listomin) wrote :

Hasn't been reproduced with Fuel 9.0-mos-460

Revision history for this message
Sergii Rizvan (srizvan) wrote :

Won't fix for 8.0 because it's pretty risky to install uwsgidecorators dependancy in stable releases.

tags: added: wonfix-riskyteam-enhancements
removed: team-enhancements
tags: added: team-enhancements wonfix-risky
removed: wonfix-riskyteam-enhancements
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/8.0)

Change abandoned by Sergii Rizvan (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/325877
Reason: Abandon for 8.0 because it's pretty risky to install uwsgidecorators dependancy in stable releases.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to mos/mos-docs (master)

Related fix proposed to branch: master
Change author: Evgeny Konstantinov <email address hidden>
Review: https://review.fuel-infra.org/22324

Revision history for this message
Bulat Gaifullin (bulat.gaifullin) wrote :

The package 'python-uwsgidecorators' should be installed on master to for proper fix this bug.
because this package provides functional which needs to work asynchronous task execution.
but automatic installation of this package was added only in master branch and it does not back ported to mitaka.
the workaround is manually install this package by 'yum install python-uwsgidecorators'.

Dmitry Pyzhov (dpyzhov)
tags: added: 9.1-proposed
Revision history for this message
Ilya Kutukov (ikutukov) wrote :

The package 'python-uwsgidecorators' should be installed on master to for proper fix this bug.

Please, reverify and re-open if required:
https://bugs.launchpad.net/fuel/+bug/1572998
https://bugs.launchpad.net/fuel/+bug/1572997

Changed in fuel:
assignee: Ilya Kutukov (ikutukov) → Fuel Toolbox (fuel-toolbox)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

verified against fuel-nailgun-8816. uwsgidecorators package is in place and added to nailgun dependencies

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/382378

tags: added: release-notes-done
removed: release-notes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/377791
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=1380bc8ecfc0a61ae81f677078b68e96844a5dfa
Submitter: Jenkins
Branch: master

commit 1380bc8ecfc0a61ae81f677078b68e96844a5dfa
Author: Bulat Gaifullin <email address hidden>
Date: Tue Sep 27 18:14:27 2016 +0300

    Do not use yaql input data conversion

    When root is set to context manually it will not
    be converted into immutable object and task serialization
    will do faster because it does not spent time to this conversion

    Related-Bug: 1569859
    Change-Id: I46e4c43a757b78d96eec3edea0f66361afae3d73

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (stable/mitaka)

Reviewed: https://review.openstack.org/382378
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=945996e4f8d3ef15ccaa9fff4f7a159f603327ac
Submitter: Jenkins
Branch: stable/mitaka

commit 945996e4f8d3ef15ccaa9fff4f7a159f603327ac
Author: Bulat Gaifullin <email address hidden>
Date: Tue Sep 27 18:14:27 2016 +0300

    Do not use yaql input data conversion

    When root is set to context manually it will not
    be converted into immutable object and task serialization
    will do faster because it does not spent time to this conversion

    Related-Bug: 1569859
    Change-Id: I46e4c43a757b78d96eec3edea0f66361afae3d73

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.