Bug #1655783 “2.1beta4: LXDs getting stuck pending in cloud-init...” : Bugs : Autopilot Log Analyser

Revision history for this message

Francis Ginther (fginther) wrote on 2017-01-11:

#1

ps-fauxww.txt Edit (67.2 KiB, text/plain)

Francis Ginther (fginther) on 2017-01-12

information type:

Proprietary → Public

Revision history for this message

Simon Poirier (simpoir) wrote on 2017-01-12:

#2

cloud-init.log Edit (38.3 KiB, text/plain)

I've hit this a few times also. In all cases it was only the first LXD container to spawn (first time-wise not necessarily container number). All containers spawning seconds later started ok.

It looks like containers are launched before the network is fully setup.
I'm attaching cloud-init logs for the last failure I've hit.

Revision history for this message

Simon Poirier (simpoir) wrote on 2017-01-12:

#3

cloud-init-output.log Edit (845.1 KiB, text/plain)

Revision history for this message

Simon Poirier (simpoir) wrote on 2017-01-12: Re: LXDs getting stuck pending in cloud-init during Landscape Autopilot deployment

#4

host lxd log. Edit (6.0 KiB, text/plain)

summary:

- LXCs getting stuck in cloud-init during Landscape Autopilot deployment
+ LXDs getting stuck pending in cloud-init during Landscape Autopilot
+ deployment

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-13:

#5

Just had another case of this: http://ci.lscape.net/job/landscape-system-tests/4996/

The first container on a new machine is stuck, the rest started just fine:
...
  '2':
    containers:
      2/lxd/0:
        instance-id: juju-bbf56e-2-lxd-0
        juju-status: {current: pending, since: '13 Jan 2017 03:01:33Z'}
        machine-status: {current: running, message: Container started, since: '13
            Jan 2017 03:03:05Z'}
        series: xenial
      2/lxd/1:
        dns-name: 10.96.109.192
        instance-id: juju-bbf56e-2-lxd-1
        ip-addresses: [10.96.109.192]
        juju-status: {current: started, since: '13 Jan 2017 03:05:07Z', version: 2.1-beta4}
        machine-status: {current: running, message: Container started, since: '13
            Jan 2017 03:03:48Z'}
        series: xenial
...

I have seen this many months ago in juju, probably even juju1. If I find a reference to that, I'll attach it here.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-13:

#6

I don't have logs from the host because of bug https://bugs.launchpad.net/juju/+bug/1646329. Our log collector script failed because it uses juju ssh/scp and juju gave it the wrong ip to connect to.

Anastasia (anastasia-macmood) on 2017-01-16

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.2.0

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-01-17:

#7

Interestingly, cloud-init-output.log, attached in comment # 3, is reporting these failures:

Something wicked happened resolving '10.5.0.10:3142' (-9 - Address family for hostname not supported)
Err:12 http://security.ubuntu.com/ubuntu xenial-security/main amd64 libnspr4 amd64 2:4.12-0ubuntu0.16.04.1

Andreas Hasenack (ahasenack) on 2017-01-18

Changed in landscape:
milestone:	none → 17.01

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#8

@anastasia-macmood, this is because the container didn't get networking setup. For some reason, this container has no static IP entry in /etc/network/interfaces or interfaces.d/*. I'm triaging another failed run now.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#9

We have another fresh run where this happened:
http://ci.lscape.net/job/landscape-system-tests/5040/

I'm going to add some log files, and to not make this confusing, all logs related to this run (5040) will have that number in their filenames.

First, here is an interesting lxc list output on the hendel node, showing only one container without an IP. Note also it's not the first one:

http://pastebin.ubuntu.com/23823808/

(width trimmed for better readability)
+---------------------+---------+--------------------+
| NAME | STATE | IPV4 |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-0 | RUNNING | 10.96.24.66 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-1 | RUNNING | |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-2 | RUNNING | 10.96.24.58 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-3 | RUNNING | 10.96.24.53 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-4 | RUNNING | 10.96.24.68 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-5 | RUNNING | 10.96.24.63 (eth0) |
+---------------------+---------+--------------------+

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#10

5040-maas-syslog Edit (4.0 MiB, text/plain)

I think this 1-lxd-1 container became node gweyff in MAAS:

Jan 18 12:49:01 atlas maas.api: [info] juju-0f41e5-1-lxd-1: Added new device (parent: hendel)
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "GET /MAAS/api/2.0/version/ HTTP/1.1" 20
0 180 "-" "Go-http-client/1.1"
Jan 18 12:49:01 atlas sh[6611]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "POST /MAAS/api/2.0/devices/?op= HTTP/1.
1" 200 506 "-" "Go-http-client/1.1"
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "PUT /MAAS/api/2.0/nodes/gweyff/interfaces/214788/ HTTP/1.1" 200 242 "-" "Go-http-client/1.1"

And I can only see these two API calls made from juju to MAAS about it:
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "PUT /MAAS/api/2.0/nodes/gweyff/interfaces/214788/ HTTP/1.1" 200 242 "-" "Go-http-client/1.1"
Jan 18 12:49:02 atlas sh[6611]: 2017-01-18 12:49:02 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:02 +0000] "GET /MAAS/api/2.0/devices/?id=gweyff HTTP/1.1" 200 519 "-" "Go-http-client/1.1"

What is missing, and it exists for the other containers, is a op=link_subnet POST call, like for this other container called qesn6m for example:
Jan 18 12:49:08 atlas sh[6616]: 2017-01-18 12:49:08 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:08 +0000] "PUT /MAAS/api/2.0/nodes/qesn6m/interfaces/214789/ HTTP/1.1" 200 243 "-" "Go-http-client/1.1"
Jan 18 12:49:15 atlas sh[6611]: 2017-01-18 12:49:15 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:15 +0000] "POST /MAAS/api/2.0/nodes/qesn6m/interfaces/214789/?op=link_subnet HTTP/1.1" 200 561 "-" "Go-http-client/1.1"
Jan 18 12:49:15 atlas sh[6611]: 2017-01-18 12:49:15 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:15 +0000] "GET /MAAS/api/2.0/devices/?id=qesn6m HTTP/1.1" 200 860 "-" "Go-http-client/1.1"

I don't know why juju wouldn't make this link_subnet call. Maybe it didn't like the response to the first PUT call?

I'm attaching this syslog log.

I think this 1-lxd-1 container became node gweyff in MAAS:

Jan 18 12:49:01 atlas maas.api: [info] juju-0f41e5-1-lxd-1: Added new device (parent: hendel)
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "GET /MAAS/api/2.0/version/ HTTP/1.1" 20
0 180 "-" "Go-http-client/1.1"
Jan 18 12:49:01 atlas sh[6611]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "POST /MAAS/api/2.0/devices/?op= HTTP/1.
1" 200 506 "-" "Go-http-client/1.1"
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "PUT /MAAS/api/2.0/nodes/gweyff/interfaces/214788/ HTTP/1.1" 200 242 "-" "Go-http-client/1.1"

And I can only see these two API calls made from juju to MAAS about it:
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "PUT /MAAS/api/2.0/nodes/gweyff/interfaces/214788/ HTTP/1.1" 200 242 "-" "Go-http-client/1.1"
Jan 18 12:49:02 atlas sh[6611]: 2017-01-18 12:49:02 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:02 +0000] "GET /MAAS/api/2.0/devices/?id=gweyff HTTP/1.1" 200 519 "-" "Go-http-client/1.1"

What is missing, and it exists for the other containers, is a op=link_subnet POST call, like for this other container called qesn6m for example:
Jan 18 12:49:08 atlas sh[6616]: 2017-01-18 12:49:08 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:08 +0000] "PUT /MAAS/api/2.0/nodes/qesn6m/interfaces/214789/ HTTP/1.1" 200 243 "-" "Go-http-client/1.1"
Jan 18 12:49:15 atlas sh[6611]: 2017-01-18 12:49:15 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:15 +0000] "POST /MAAS/api/2.0/nodes/qesn6m/interfaces/214789/?op=link_subnet HTTP/1.1" 200 561 "-" "Go-http-client/1.1"
Jan 18 12:49:15 atlas sh[6611]: 2017-01-18 12:49:15 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:15 +0000] "GET /MAAS/api/2.0/devices/?id=qesn6m HTTP/1.1" 200 860 "-" "Go-http-client/1.1"

I don't know why juju wouldn't make this link_subnet call. Maybe it didn't like the response to the first PUT call?

I'm attaching this syslog log.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#11

5040-1-lxd-1.tar.gz Edit (676.3 KiB, application/x-tar)

/etc/* and /var/log/* from inside of the stuck container 1/lxd/1

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#12

5040-host-of-1-lxd-1.tar.bz2 Edit (332.3 KiB, application/x-tar)

/etc/* and /var/log/* from the host (hendel) of the stuck 1/lxd/1 container. Note there is a base-machine-4/var/log/ps-fauxww.txt file with the process listing generated at log collection time.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#13

5040-controller.tar.bz2 Edit (10.4 MiB, application/x-tar)

/etc/* and /var/log/* from the controller node. This also has a ps file in the same location with the process listing.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#14

5040-maas-syslog Edit (4.0 MiB, text/plain)

/var/log/syslog on the MAAS node, with just the entries that cover this failed #5040 deployment. Note this is a shared maas server and it will have log entries about other things happening to this node, not just this failed deployment.

I also have regiond.log and maas.log, but I thought this syslog bit was better because it has all messages intertwined.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-18:

#15

5040-juju-status.txt Edit (52.0 KiB, text/plain)

juju status, for completeness.

Chad Smith (chad.smith) on 2017-01-18

summary:

- LXDs getting stuck pending in cloud-init during Landscape Autopilot
- deployment
+ 2.1beta4: LXDs getting stuck pending in cloud-init during Landscape
+ Autopilot deployment

Chad Smith (chad.smith) on 2017-01-20

Changed in landscape:
status:	New → Confirmed

Francis Ginther (fginther) on 2017-01-20

Changed in landscape:
importance:	Undecided → High
assignee:	nobody → Francis Ginther (fginther)

Francis Ginther (fginther) on 2017-01-20

Changed in autopilot-log-analyser:
status:	New → In Progress
importance:	Undecided → High

🤖 Landscape Builder (landscape-builder) on 2017-01-20

Changed in autopilot-log-analyser:
status:	In Progress → Fix Committed

Chris Gregan (cgregan) on 2017-01-23

tags:

added: cdo-qa-blocker

John A Meinel (jameinel) on 2017-01-24

Changed in juju:
milestone:	2.2.0 → 2.1.0

Revision history for this message

John A Meinel (jameinel) wrote on 2017-01-31:

#16

Juju <=2.1b4 all have an issue where if there is an error we'll end up setting it in a place (agent status) that ends up getting overridden with "agent is not running" message. I have a patch in the 2.1 branch already that at least moves the error message into machine-status instead of agent status. These aren't in tabular status by default, but you can do "juju status --format=yaml".

Revision history for this message

John A Meinel (jameinel) wrote on 2017-01-31:

#17

"I don't know why juju wouldn't make this link_subnet call. Maybe it didn't like the response to the first PUT call?" is likely that Juju ran into some other error before it got to that point, which leaves us trying to infer it from the log files.

Is it possible to do a run at "DEBUG" level instead of WARNING? I see this:
2017-01-18 12:48:12 DEBUG juju.worker.logger logger.go:50 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"

And unfortunately, the only warnings we get are:
2017-01-18 12:48:59 WARNING juju.provisioner broker.go:97 incomplete DNS config found, discovering host's DNS config

Which are indications that it is trying to provision the instances (it would report it 1 time for each container), but clearly whatever actual failure is going on isn't being done at WARNING level.

or if full DEBUG is too verbose, we could probably do:
logging-config="<root>=WARNING;unit=DEBUG;juju.provisioner=DEBUG"

would at least help.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-02-02:

#18

This bug isn't happening frequently enough that we can have all our deployments done with that extra debugging. The extra debugging increases load, makes the deployment take longer and also makes mongo use up more memory. I was hoping that code inspection around the area that makes the link_subnet call to MAAS could yield some results.

Chad Smith (chad.smith) on 2017-02-10

Changed in landscape:
milestone:	17.01 → 17.02

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-17:

#19

Could you please re-try with Juju 2.1-rc2?

Changed in juju:
status:	Triaged → Incomplete
milestone:	2.1-rc2 → none

Revision history for this message

Chad Smith (chad.smith) wrote on 2017-03-16:

#20

Have upgraded juju2.1.1 for deployments and haven't seen this in a while. Will reopen if we see this again.

Changed in landscape:
status:	Confirmed → Incomplete

Chad Smith (chad.smith) on 2017-03-16

Changed in landscape:
milestone:	17.02 → 17.03

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-03-17:

#21

Marking this as Fix Committed in 2.1.1 based on comment # 20.

Any further failures in this area should have a separate report with newer logs.

Changed in juju:
status:	Incomplete → Fix Released
milestone:	none → 2.1.1

David Britton (dpb) on 2017-03-28

no longer affects:

landscape

Affects		Status	Importance	Assigned to	Milestone
	Autopilot Log Analyser	Fix Committed	High	Francis Ginther
	Canonical Juju	Fix Released	High	Unassigned	Canonical Juju 2.1.1

Autopilot Log Analyser

2.1beta4: LXDs getting stuck pending in cloud-init during Landscape Autopilot deployment

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches