The lng CI loop cannot alter the yaml timeout for the whole job

Bug #1238685 reported by Mike Holmes
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LAVA Dispatcher
Triaged
High
Unassigned

Bug Description

When generating a CI spwaned LAVA job, via post-build-lava.py, the single test timeout can be altered and is set correctly for cyclictest -> 'cyclictest': 90000,
However the total test time is hardwired to 1800
"
  config = json.dumps({'timeout': 18000,
"

It should probably track the length of the longest test + some slack for LAVA to finish uploading to the dashboard etc.

This appears to be the reason that when spawned from CI cyclictest never completes.

Milo Casagrande (milo)
affects: linaro-ci-dashboard → linaro-ci
Fathi Boudra (fboudra)
Changed in linaro-ci:
assignee: nobody → Fathi Boudra (fboudra)
milestone: none → 2013.10
importance: Undecided → High
status: New → Triaged
Revision history for this message
Fathi Boudra (fboudra) wrote :

jtreg tests are running longer than 5 hours and we don't observe the issue that you describe.
Though, it has the same timeout of 18000s (5h):
http://validation.linaro.org/scheduler/job/78821

Looking at a cyclictest job:
http://validation.linaro.org/scheduler/job/77654
http://validation.linaro.org/scheduler/job/77654/log_file#L_31_1

Lava failed at action lava_test_shell with error:None

Changed in linaro-ci:
assignee: Fathi Boudra (fboudra) → nobody
Revision history for this message
Mike Holmes (mike-holmes) wrote : Re: [Bug 1238685] Re: The lng CI loop cannot alter the yaml timeout for the whole job

http://validation.linaro.org/scheduler/job/74859/log_file Also has the
name "error none" which Tyler thought indicated a time out.

<LAVA_DISPATCHER>2013-09-28 12:55:15 AM WARNING: [ACTION-E]
lava_test_shell is finished with error (None).

So I will open this as a LAVA bug, if we think the time out is ok.

On 16 October 2013 03:58, Fathi Boudra <email address hidden> wrote:

> jtreg tests are running longer than 5 hours and we don't observe the issue
> that you describe.
> Though, it has the same timeout of 18000s (5h):
> http://validation.linaro.org/scheduler/job/78821
>
> Looking at a cyclictest job:
> http://validation.linaro.org/scheduler/job/77654
> http://validation.linaro.org/scheduler/job/77654/log_file#L_31_1
>
> Lava failed at action lava_test_shell with error:None
>
>
> ** Changed in: linaro-ci
> Assignee: Fathi Boudra (fboudra) => (unassigned)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1238685
>
> Title:
> The lng CI loop cannot alter the yaml timeout for the whole job
>
> Status in Linaro Continuous Integration:
> Triaged
>
> Bug description:
> When generating a CI spwaned LAVA job, via post-build-lava.py, the
> single test timeout can be altered and is set correctly for cyclictest ->
> 'cyclictest': 90000,
> However the total test time is hardwired to 1800
> "
> config = json.dumps({'timeout': 18000,
> "
>
> It should probably track the length of the longest test + some slack
> for LAVA to finish uploading to the dashboard etc.
>
> This appears to be the reason that when spawned from CI cyclictest
> never completes.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-ci/+bug/1238685/+subscriptions
>

Revision history for this message
Fathi Boudra (fboudra) wrote :

let's get LAVA guys feedback first. we can re-assign the bug as needed
if we get a clear picture on the root cause.

Revision history for this message
Neil Williams (codehelp) wrote :

The timeout value is action specific, there is no timeout for the entire job. Each action uses the default timeout unless it has a specific timeout supplied.

If every action has a timeout, then a singlenode job will not use the "default" timeout at all.

This issue is particularly important for MultiNode, so there is an explicit section in the documentation for this issue. See http://validation.linaro.org/static/docs/lava-dispatcher/multinode.html#lava-multi-node-timeout-behaviour

It is particularly important to not make the default timeout very long as this causes unexpected delays. If there is one action which is known to take a long time, that individual action needs a specific timeout.

There is generally no reason to have a default timeout larger than 900 seconds. All actions which need longer than that should set their own timeout in the JSON.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

AFAIK, timeout specified globally, outside any action definition, is timeout for internal LAVA operations, like waiting response from board, possibly booting, etc. Then 18000 is ridiculously high timeout for that. Most people don't have issues with that, because it never happens to them. But it did for me when trying to do gcc build in LAVA (using cbuild) - after 5 hours of work, board could go thermal and hang, then hang waiting for timeout another 5 hours before being rebooted. So, for gcc builds I used 900 global timeout, while 72000 compile timeout: http://validation.linaro.org/scheduler/job/62867/definition . Yes, when a board didn't go thermal, build went fine with such settings (linked job shows being run for close to 7 hours).

Revision history for this message
Neil Williams (codehelp) wrote :

There is also no need to have boot_linaro_image in the JSON when you are using lava_test_shell. The logs show that the device is being rebooted unnecessarily.
http://validation.linaro.org/scheduler/job/77654/log_file#L_16_0
http://validation.linaro.org/scheduler/job/77654/log_file#L_28_3
http://validation.linaro.org/scheduler/job/77654/log_file#L_29_406

Compare with:
https://staging.validation.linaro.org/scheduler/job/1353/log_file
and
http://validation.linaro.org/scheduler/job/73432/log_file
http://validation.linaro.org/scheduler/job/73432/definition

boot_linaro_image is fine if all you want to do is see if the supplied kernel/image/hwpack actually boots. When you want to run tests after booting the test image/kernel/hwpack, deploy_* and lava_test_shell is all you need.

Fathi Boudra (fboudra)
affects: linaro-ci → lava-dispatcher
Changed in lava-dispatcher:
milestone: 2013.10 → none
Revision history for this message
Mike Holmes (mike-holmes) wrote :

the ciadmin jobs add
        {
            "command": "boot_linaro_image"
        },

Should we change that ?

On 16 October 2013 10:29, Neil Williams <email address hidden> wrote:

> There is also no need to have boot_linaro_image in the JSON when you are
> using lava_test_shell. The logs show that the device is being rebooted
> unnecessarily.
> http://validation.linaro.org/scheduler/job/77654/log_file#L_16_0
> http://validation.linaro.org/scheduler/job/77654/log_file#L_28_3
> http://validation.linaro.org/scheduler/job/77654/log_file#L_29_406
>
> Compare with:
> https://staging.validation.linaro.org/scheduler/job/1353/log_file
> and
> http://validation.linaro.org/scheduler/job/73432/log_file
> http://validation.linaro.org/scheduler/job/73432/definition
>
> boot_linaro_image is fine if all you want to do is see if the supplied
> kernel/image/hwpack actually boots. When you want to run tests after
> booting the test image/kernel/hwpack, deploy_* and lava_test_shell is
> all you need.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1238685
>
> Title:
> The lng CI loop cannot alter the yaml timeout for the whole job
>
> Status in Linaro Continuous Integration:
> Triaged
>
> Bug description:
> When generating a CI spwaned LAVA job, via post-build-lava.py, the
> single test timeout can be altered and is set correctly for cyclictest ->
> 'cyclictest': 90000,
> However the total test time is hardwired to 1800
> "
> config = json.dumps({'timeout': 18000,
> "
>
> It should probably track the length of the longest test + some slack
> for LAVA to finish uploading to the dashboard etc.
>
> This appears to be the reason that when spawned from CI cyclictest
> never completes.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-ci/+bug/1238685/+subscriptions
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.