"Build timed out" in gate-horizon-selenium

Bug #1317630 reported by Gary W. Smith
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Dashboard (Horizon)
Fix Released
High
Kieran Spear

Bug Description

The gate-horizon-selenium job is timing out. End portion of the log:

2014-05-07 21:00:18.331 | nosetests horizon --nocapture --nologcapture --exclude-dir=horizon/conf/ --exclude-dir=horizon/test/customization --cover-package=horizon --cover-inclusive --all-modules --exclude-dir=openstack_dashboard/test/integration_tests --verbosity=1
2014-05-07 21:00:18.332 | Creating test database for alias 'default'...
2014-05-07 21:00:18.332 | Destroying test database for alias 'default'...
2014-05-07 21:00:18.337 | Running openstack_dashboard tests
2014-05-07 21:00:20.820 | WARNING:root:No local_settings file found.
2014-05-07 21:00:21.043 |
2014-05-07 21:27:42.235 | SSS (abbreviated...) Build timed out (after 30 minutes). Marking the build as failed.
2014-05-07 21:27:44.160 | /tmp/hudson2041926875103250367.sh: line 2: 2092 Terminated /usr/local/jenkins/slave_scripts/run-selenium.sh openstack horizon
2014-05-07 21:27:44.255 | [SCP] Copying console log.
2014-05-07 21:27:45.107 | [SCP] Trying to create /srv/static/logs/03/92703/1/check/gate-horizon-selenium
2014-05-07 21:27:45.120 | [SCP] Trying to create /srv/static/logs/03/92703/1/check/gate-horizon-selenium/dd448f1
2014-05-07 21:27:45.140 | Finished: FAILURE

See http://logs.openstack.org/03/92703/1/check/gate-horizon-selenium/dd448f1/console.html for full log

Revision history for this message
Julie Pichon (jpichon) wrote :

I've seen this in several reviews as well.

Changed in horizon:
status: New → Confirmed
Julie Pichon (jpichon)
Changed in horizon:
importance: Undecided → High
milestone: none → juno-1
Changed in horizon:
assignee: nobody → Gary W. Smith (gary-w-smith)
Revision history for this message
Gary W. Smith (gary-w-smith) wrote :

Summary of findings:

The gate-horizon-selenium job starts Xvfb on display :99, then uses selenium and firefox to run 2 horizon selenium tests, then 7 dashboard tests (after skipping 117 unit tests). During one of the tests, selenium suddenly hangs, apparently on the get(), and the job times out after 30 minutes.

When the job hangs, it is not always on the same test; sometimes it hangs on the first horizon test, sometimes the last dashboard test, and sometimes one in between.

Many executions of this job have succeeded, so it is sporadic.

I found the following instances of failures of this job:
- http://logs.openstack.org/03/92703/1/check/gate-horizon-selenium/dd448f1
- http://logs.openstack.org/05/91905/1/check/gate-horizon-selenium/be82d92
- http://logs.openstack.org/06/92006/1/check/gate-horizon-selenium/80bd2ba
- http://logs.openstack.org/16/90716/4/check/gate-horizon-selenium/f898b0a
- http://logs.openstack.org/42/93142/1/check/gate-horizon-selenium/682963d
- http://logs.openstack.org/58/92958/1/check/gate-horizon-selenium/d4a9170
- http://logs.openstack.org/58/92958/3/check/gate-horizon-selenium/a4761bd

The errors have occured on a number of different build systems (jenkins01, 03, 05, and 07), so it is not isolated to a particular build system

The previous jobs on any particular slave have succeeded, so it does not appear to be due to some leftover resource.

There have been no cases of multiple instances of this job running on the same system at the same time, so it does not appear to be a resource contention issue.

I have not been able reproduce the hanging behavior locally. I have attempted (unsuccessfully) to induce it by:
- Running a multiple Xvfb instances
- Killing Xvfb before the tests are running
- Killing Xvfb while the tests are running
- Using with an invalid URL (by modifying the source)
but in all cases the tests fail immediately without hanging

Google searches reveal that others have had issues with the python+firefox+selenium hanging sporadically on get(), but none with the exact combination of versions that we use. This is the most likely cause of the problem, but without having a reproducible test it is difficult to verify that that this is indeed the cause nor that upgrading versions will fix the problem.

Changed in horizon:
assignee: Gary W. Smith (gary-w-smith) → nobody
Revision history for this message
Julie Pichon (jpichon) wrote :

Patch proposed to add the bug's fingerprint to elastic recheck - https://review.openstack.org/#/c/93840/

Revision history for this message
Julie Pichon (jpichon) wrote :

Gary, thanks a lot for looking into this and posting what you discovered so far, this is very helpful.

Looking at the logstash results it looks like the failures only happened when running on HP cloud nodes, though I'm failing to discern any other pattern so far.

Kieran Spear (kspear)
Changed in horizon:
assignee: nobody → Kieran Spear (kspear)
Revision history for this message
Kieran Spear (kspear) wrote :

I have a fix for this that eliminates the hangs entirely, but it involves overriding certain internal bits in the Firefox WebDriver. Even with that fix, there's still another random failure in the qunit tests in 10% of runs.

What I'm going to propose instead is to switch to use the PhantomJS webdriver. This seems to run without the same flakiness we see in the other drivers that require an X display.

Revision history for this message
Julie Pichon (jpichon) wrote :

How were you able to reproduce the hangs? Would the switch to that driver require the tests to be rewritten in Javascript?

Revision history for this message
Kieran Spear (kspear) wrote :

I brought up an XS instance on hpcloud and ran the tests there. I can reproduce this locally as well, but it's somewhat less frequent. Basically the problem is a race condition that leaves hanging firefox processes (created with the "firefox -silent" command, which is used by web driver to reset the fx profile). If I kill those processes manually, the tests then continue and succeed.

I applied a variation of the fix used in SST and the hangs stopped:
http://code.ohloh.net/file?fid=OruYKyN-ihGMgE1hoxb7Lkg1cqI&cid=bSNDWAKLkUw&s=&fp=299267&mp&projSelected=true#L108

Selenium has a WebDriver for PhantomJS, so it plugs into the initial driver creation we have - no changes to our existing tests necessary. I think I spoke too soon regarding the flakiness though… I see qunit failures with the PhantomJS driver as well. I've increased the socket timeouts and wait times and am currently running the tests 100 times to see if that has helped.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to horizon (master)

Fix proposed to branch: master
Review: https://review.openstack.org/95392

Changed in horizon:
status: Confirmed → In Progress
Revision history for this message
Kieran Spear (kspear) wrote :

So I've managed to fix the failures in qunit/jasmine when using the Firefox driver, just waiting for CI to prove me wrong...

The same fixes don't fix the (same?) failures in the PhantomJS runs, so I've abandoned that work for now. It might be worth revisiting it in the future, as PhantomJS is about 2x faster.

If this fixes things I'll have to bug dklyle about a hpcloud credit. :-)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to horizon (master)

Reviewed: https://review.openstack.org/95392
Committed: https://git.openstack.org/cgit/openstack/horizon/commit/?id=1cea1cc9a098eabb5b79f7db4338db032c16fad8
Submitter: Jenkins
Branch: master

commit 1cea1cc9a098eabb5b79f7db4338db032c16fad8
Author: Kieran Spear <email address hidden>
Date: Mon May 26 09:24:50 2014 +1000

    Make selenium tests more robust

    The selenium tests are often hanging in the gate due to rogue
    'firefox -silent' processes. Borrow a fix from SST that overrides
    the default Firefox WebDriver so it doesn't spawn the processes in the
    first place.

    There's also a ~10% timeout failure in the qunit/jasmine tests.
    Make sure the default socket timeout is set early enough so
    the initial selenium/testserver setup doesn't occasionally fail on the
    first test of a test case. This gets it down to around 5% failure rate.

    Also wrap each test case in a sleep() sandwich to give services time to
    settle before the tests start. This isn't ideal as it makes the tests
    take longer, but with this change I finally get no false negatives at all
    over 100+ runs.

    Change-Id: I24963f575fd6eeb81dbff883404c7cff45e97397
    Closes-bug: 1317630

Changed in horizon:
status: In Progress → Fix Committed
Revision history for this message
Matt Riedemann (mriedem) wrote :
Thierry Carrez (ttx)
Changed in horizon:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in horizon:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.