Autopilot Log Analyser

_wait_all_computers gives up too soon

Bug #1661378 reported by Francis Ginther on 2017-02-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Autopilot Log Analyser	Fix Committed	Undecided	Francis Ginther
	Landscape Server	Fix Released	Medium	Chad Smith	Landscape Server 17.02

Bug Description

This was seen in CI: https://ci.lscape.net/job/landscape-system-tests/5175
Landscape server: 17.01~bzr10865+jenkins3423-2

This is very similar to lp:1660000 in which a slowly progress rabbitmq-server charm prevents landscape-client from making forward progress. But in this case, rabbitmqctl was not impacted by the known bug and the charm was progressing as fast as possible, it just has a lot to do.

Charm hooks are executing in serial, so as long as rabbitmq-server is busy processing hooks, the landscape-client suboridinate hooks don't get a chance to run and it doesn't make it far enough to register the client until it's too late.

Here's what happened:

09:48:30 first retry of _wait_all_computers (job-handler.log)
09:49:28 start amqp-relation-changed hooks (unit-rabbitmq-server-0.log)
10:39:36 amqp-relation-changed hooks finish (unit-rabbitmq-server-0.log)
10:39:39 landscape-client/21's install hook starts running (unit-landscape-client-21.log)
10:40:00 another amqp-relation-changed hook runs (unit-rabbitmq-server-0.log)
10:48:02 Last retry of _wait_all_computers (job-handler.log)
10:49:02 amqp-relation-changed hook finishes (unit-rabbitmq-server-0.log)
10:49:05 landscape-client/21 runs container-relation-joined (unit-landscape-client-21.log)
10:49:06 landscape-client/21 registers computer (unit-landscape-client-21.log)

The last retry of _wait_all_computers occurred just over a minute before the client was actually registered.

Logs from the rabbitmq-server/0 unit and the job-handler are attached.

Revision history for this message

Francis Ginther (fginther) wrote on 2017-02-02:

landscape-wait-gives-up-too-soon.tar.gz Edit (376.4 KiB, application/x-tar)

Francis Ginther (fginther) on 2017-02-07

information type:	Proprietary → Public
Changed in autopilot-log-analyser:
assignee:	nobody → Francis Ginther (fginther)
status:	New → In Progress

Revision history for this message

Adam Collard (adam-collard) wrote on 2017-02-08:

It waits an hour, how long are you suggesting it waits for?

Revision history for this message

Francis Ginther (fginther) wrote on 2017-02-08:

We modified the sequence of events in https://bugs.launchpad.net/landscape/+bug/1654399, maybe we should have bumped up this retry time too.

Do we have an upper bound on the autopilot deployment time? Would doubling the _wait_all_computers retry window cause us to hit that?

There is now a second example of this:
https://ci.lscape.net/job/landscape-system-tests/5215/

In 5215, the "Wait for computers to register" activity is:
    {
        "children": [],
        "schedule_after_time": null,
        "region_id": 1,
        "creator": {
            "id": 1,
            "name": "John Doe",
            "email": "<email address hidden>"
        },
        "result_text": "1 machines not yet registered",
        "activity_status": "failed",
        "creation_time": "2017-02-07T19:53:50Z",
        "schedule_before_time": null,
        "summary": "Wait for computers to register",
        "completion_time": "2017-02-07T22:06:13Z",
        "modification_time": "2017-02-07T22:06:13Z",
        "parent_id": 1,
        "type": "WaitForRegisteredComputersRequest",
        "id": 250,
        "result_code": null
    },

[from landscape-0-inner-logs/rabbitmq-server-0/var/log/juju/unit-landscape-client-20.log]
2017-02-07 22:06:26 INFO juju-log container:93: In container-relation-joined for landscape-client/20
2017-02-07 22:06:26 INFO juju-log container:93: Updating client configuration for landscape-client/20
2017-02-07 22:06:26 INFO juju-log container:93: landscape-client/20: landscape-client is not configured to run as per sysvconfig
2017-02-07 22:06:27 INFO container-relation-joined Restarting landscape-client (via systemctl): landscape-client.service.
2017-02-07 22:06:27 INFO juju-log container:93: Registering landscape-client/20

The missing client registered just seconds after the activity failed. I have not studied successful tests to determine if we are generally running close to the retry cap or if these are just outliers.

🤖 Landscape Builder (landscape-builder) on 2017-02-10

Changed in autopilot-log-analyser:
status:	In Progress → Fix Committed

Chad Smith (chad.smith) on 2017-02-10

Changed in landscape:
milestone:	17.01 → 17.02

David Britton (dpb) on 2017-02-22

Changed in landscape:
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → Chad Smith (chad.smith)

🤖 Landscape Builder (landscape-builder) on 2017-02-22

Changed in landscape:
status:	In Progress → Fix Committed

Andreas Hasenack (ahasenack) on 2017-04-06

Changed in landscape:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

landscape-wait-gives-up-too-soon.tar.gz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.