LXC containers in pending state due to juju-br0 misconfiguration

Bug #1395908 reported by Larry Michel
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Dimiter Naydenov
1.20
Fix Released
High
Dimiter Naydenov
1.21
Fix Released
High
Dimiter Naydenov

Bug Description

We're seeing a number of containers stuck in pending then usual since upgrade from 1.20.11 to 1.20.12. This is causing deployment timeouts.

Looking at the logs, it's not clear why the containers are stuck in the pending state since there are no error messages. However, this has been recreated multiple times and saw as many as 5 systems in same state with relatively short period of time.

  '1':
    agent-state: started
    agent-version: 1.20.12
    containers:
      1/lxc/0:
        agent-state: pending
        hardware: arch=amd64
        instance-id: juju-machine-1-lxc-0
        series: trusty
      1/lxc/1:
        agent-state: pending
        hardware: arch=amd64
        instance-id: juju-machine-1-lxc-1
        series: trusty
    dns-name: skookum.oil

$ grep "machine-1-lxc-0" juju_debug_log.txt
machine-0: 2014-11-24 17:45:30 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":55,"Type":"Provisioner","Request":"Life","Params":{"Entities":[{"Tag":"machine-1-lxc-0"}]}}
machine-0: 2014-11-24 17:45:30 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":57,"Type":"Provisioner","Request":"InstanceId","Params":{"Entities":[{"Tag":"machine-1-lxc-0"}]}}
machine-0: 2014-11-24 17:45:30 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":58,"Type":"Provisioner","Request":"Status","Params":{"Entities":[{"Tag":"machine-1-lxc-0"}]}}
machine-0: 2014-11-24 17:45:30 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":61,"Type":"Provisioner","Request":"InstanceId","Params":{"Entities":[{"Tag":"machine-1-lxc-0"}]}}
machine-0: 2014-11-24 17:45:30 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":63,"Type":"Provisioner","Request":"SetPasswords","Params":{"Changes":[{"Tag":"machine-1-lxc-0","Password":"NG6pZUlqkgbl4G1mrGISwXsR"}]}}
machine-0: 2014-11-24 17:45:31 INFO juju.state.apiserver.common password.go:98 setting password for "machine-1-lxc-0"
machine-0: 2014-11-24 17:45:31 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":64,"Type":"Provisioner","Request":"ProvisioningInfo","Params":{"Entities":[{"Tag":"machine-1-lxc-0"}]}}
machine-1: 2014-11-24 17:50:05 INFO juju.provisioner.lxc lxc-broker.go:100 started lxc container for machineId: 1/lxc/0, juju-machine-1-lxc-0, arch=amd64
machine-1: 2014-11-24 17:50:05 INFO juju.provisioner provisioner_task.go:482 started machine 1/lxc/0 as instance juju-machine-1-lxc-0 with hardware "arch=amd64", networks [], interfaces []
machine-0: 2014-11-24 17:50:05 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-1 {"RequestId":71,"Type":"Provisioner","Request":"SetInstanceInfo","Params":{"Machines":[{"Tag":"machine-1-lxc-0","InstanceId":"juju-machine-1-lxc-0","Nonce":"machine-1:af1c3a7f-1e79-4d1b-8985-31d24cf2e9e4","Characteristics":{"Arch":"amd64"},"Networks":null,"Interfaces":null}]}}
machine-0: 2014-11-24 17:50:09 DEBUG juju.state.apiserver apiserver.go:157 -> [37] user-admin 19.922566583s {"RequestId":61,"Response":{"Deltas":[["machine","change",{"Id":"1/lxc/0","InstanceId":"juju-machine-1-lxc-0","Status":"pending","StatusInfo":"","StatusData":null,"Life":"alive","Series":"trusty","SupportedContainers":null,"SupportedContainersKnown":false,"HardwareCharacteristics":{"Arch":"amd64"},"Jobs":["JobHostUnits"],"Addresses":[]}]]}} AllWatcher["1"].Next
machine-0: 2014-11-24 18:48:50 DEBUG juju.state.apiserver apiserver.go:157 -> [84] user-admin 36.538273ms {"RequestId":3,"Response":{"EnvironmentName":"maas","Machines":{"0":{"Agent":{"Status":"started","Info":"","Data":{},"Version":"1.20.12","Life":"","Err":null},"AgentState":"started","AgentStateInfo":"","AgentVersion":"1.20.12","Life":"","Err":null,"DNSName":"reading.oil","InstanceId":"/MAAS/api/1.0/nodes/node-f64f188e-ae16-11e3-b194-00163efc5068/","InstanceState":"","Series":"precise","Id":"0","Containers":{},"Hardware":"arch=amd64 cpu-cores=8 mem=32768M tags=hw-ok,oil-slave-4,hardware-hp-proliant-DL320E","Jobs":["JobManageEnviron","JobHostUnits"],"HasVote":true,"WantsVote":true},"1":{"Agent":{"Status":"started","Info":"","Data":{},"Version":"1.20.12","Life":"","Err":null},"AgentState":"started","AgentStateInfo":"","AgentVersion":"1.20.12","Life":"","Err":null,"DNSName":"skookum.oil","InstanceId":"/MAAS/api/1.0/nodes/node-600e169a-ae98-11e3-b194-00163efc5068/","InstanceState":"","Series":"trusty","Id":"1","Containers":{"1/lxc/0":{"Agent":{"Status":"pending","Info":"","Data":{},"Version":"","Life":"","Err":null},"AgentState":"pending","AgentStateInfo":"","AgentVersion":"","Life":"","Err":null,"DNSName":"","InstanceId":"juju-machine-1-lxc-0","InstanceState":"","Series":"trusty","Id":"1/lxc/0","Containers":{},"Hardware":"arch=amd64","Jobs":["JobHostUnits"],"HasVote":false,"WantsVote":false},"1/lxc/1":{"Agent":{"Status":"pending","Info":"","Data":{},"Version":"","Life":"","Err":null},"AgentState":"pending","AgentStateInfo":"","AgentVersion":"","Life":"","Err":null,"DNSName":"","InstanceId":"juju-machine-1-lxc-1","InstanceState":"","Series":"trusty","Id":"1/lxc/1","Containers":{},"Hardware":"arch=amd64","Jobs":["JobHostUnits"],"HasVote":false,"WantsVote":false}},"Hardware":"arch=amd64 cpu-cores=4 mem=32768M tags=hw-ok,oil,hardware-dell-poweredge-R210,transient-error-1","Jobs":["JobHostUnits"],"HasVote":false,"WantsVote":false},"2":{"Agent":{"Status":"started","Inf

Revision history for this message
Larry Michel (lmic) wrote :

Juju log files attached.

Abel Deuring (adeuring)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: lxc
Revision history for this message
Ian Booth (wallyworld) wrote :

This sounds like a error has occurred running cloud init in the container. Can we please get the cloud-init-output.log files attached to the bug.

Revision history for this message
Larry Michel (lmic) wrote :

Attaching var/log content with the cloud-init* log files included for

'2':
    agent-state: started
    agent-version: 1.20.12
    containers:
      2/lxc/0:
        agent-state: pending
        hardware: arch=amd64
        instance-id: juju-machine-2-lxc-0
        series: trusty
      2/lxc/1:
        agent-state: pending
        hardware: arch=amd64
        instance-id: juju-machine-2-lxc-1
        series: trusty
    dns-name: hayward-34.oil
    hardware: arch=amd64 cpu-cores=8 mem=16384M
    instance-id: /MAAS/api/1.0/nodes/node-a0cc9b34-c4cd-11e3-8102-00163efc5068/
    series: trusty

tags: added: oil
Revision history for this message
Ian Booth (wallyworld) wrote :

The attached logs looks like their taken from machine 2, the host machine for the containers. What we need are the actual cloud init output logs from the actual containers, not the host machine. For local provider, these are in /var/lib/juju/containers on the host, and I think that's the same for cloud machines. Or you can ssh into the container and get the logs that way too.

Revision history for this message
Larry Michel (lmic) wrote :

This is for:

'3':
    agent-state: started
    agent-version: 1.20.12
    containers:
      3/lxc/0:
        agent-state: pending
        hardware: arch=amd64
        instance-id: juju-machine-3-lxc-0
        series: trusty
      3/lxc/1:
        agent-state: pending
        hardware: arch=amd64
        instance-id: juju-machine-3-lxc-1
        series: trusty
    dns-name: pomeroy.oil
    hardware: arch=amd64 cpu-cores=16 mem=12288M
    instance-id: /MAAS/api/1.0/nodes/node-a66d0b4a-24b4-11e4-8a6a-00163eca07b6/
    series: trusty

I am seeing that the containers did not get an IP address:

ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | . | . | 00:16:3e:6f:ba:8e |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.22
Revision history for this message
Ian Booth (wallyworld) wrote :

Containers not having an IP address will explain why they remain pending, as the machine agent running in the container can't call home back to the state server. I seem to recall there was a MAAS and/or cloud init bug recently which dealt with this issue, and there may have been an upstream cloud init fix. But I can't recall the bug details.

Revision history for this message
Ian Booth (wallyworld) wrote :

Here's a link to the other bug that may be relevant

https://bugs.launchpad.net/cloud-init/+bug/1345433

Larry Michel (lmic)
summary: - LXC containers in pending state but no error message
+ LXC containers in pending state due to juju-br0 misconfiguration
Revision history for this message
Larry Michel (lmic) wrote :

On a system that's showing the problem, I see that the wrong NIC is being used for juju-br0. This is the reason that the container's NIC is not getting a DHCP IP. On the host, only eth0 is connected. This from config for one of the containers and the /etc/network/interfaces file on the host:

lxc.mount = /var/lib/lxc/juju-machine-1-lxc-0/fstab
lxc.mount.entry = proc proc proc nodev,noexec,nosuid 0 0
lxc.mount.entry = sysfs sys sysfs defaults 0 0
lxc.mount.entry = /sys/fs/fuse/connections sys/fs/fuse/connections none bind,optional 0 0
lxc.mount.entry = /sys/kernel/debug sys/kernel/debug none bind,optional 0 0
lxc.mount.entry = /sys/kernel/security sys/kernel/security none bind,optional 0 0
lxc.mount.entry = /sys/fs/pstore sys/fs/pstore none bind,optional 0 0
lxc.tty = 4
lxc.pts = 1024
lxc.devttydir = lxc
lxc.arch = x86_64
lxc.seccomp = /usr/share/lxc/config/common.seccomp
lxc.cgroup.devices.deny = a
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
lxc.cgroup.devices.allow = c 5:0 rwm
lxc.cgroup.devices.allow = c 5:1 rwm
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 1:9 rwm
lxc.cgroup.devices.allow = c 5:2 rwm
lxc.cgroup.devices.allow = c 136:* rwm
lxc.cgroup.devices.allow = c 254:0 rm
lxc.cgroup.devices.allow = c 10:229 rwm
lxc.cgroup.devices.allow = c 10:200 rwm
lxc.cgroup.devices.allow = c 1:7 rwm
lxc.cgroup.devices.allow = c 10:228 rwm
lxc.cgroup.devices.allow = c 10:232 rwm
lxc.utsname = juju-machine-1-lxc-0
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = juju-br0
lxc.network.hwaddr = 00:16:3e:8d:49:0b
lxc.cap.drop = sys_module
lxc.cap.drop = mac_admin
lxc.cap.drop = mac_override
lxc.cap.drop = sys_time
lxc.hook.clone = /usr/share/lxc/hooks/ubuntu-cloud-prep
lxc.rootfs = /var/lib/lxc/juju-machine-1-lxc-0/rootfs
lxc.pivotdir = lxc_putold
lxc.start.auto = 1
lxc.mount.entry=/var/log/juju var/log/juju none defaults,bind 0 0

auto lo

iface eth3 inet dhcp

iface eth4 inet dhcp

iface eth5 inet dhcp

auto eth0

iface eth0 inet dhcp

iface eth1 inet dhcp

iface eth2 inet manual

auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth2

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (4.0 KiB)

I think I have figured out how to recreate this. This is happening on system with 2 different types of card. I think this really needs to be critical. I have a system which does not even have a eth2/eth3 interface. Yet, it keeps picking eth2 for juju-br0.

$ sudo ifconfig
eth0 Link encap:Ethernet HWaddr d4:ae:52:cb:c0:fa
          inet addr:10.245.0.242 Bcast:10.245.63.255 Mask:255.255.192.0
          inet6 addr: fe80::d6ae:52ff:fecb:c0fa/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:162462 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34639 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:233671235 (233.6 MB) TX bytes:2994444 (2.9 MB)

juju-br0 Link encap:Ethernet HWaddr fe:8b:6f:87:1e:61
          inet6 addr: fe80::786a:19ff:fef7:9a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:274 errors:0 dropped:0 overruns:0 frame:0
          TX packets:134 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:77656 (77.6 KB) TX bytes:43740 (43.7 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:198 errors:0 dropped:0 overruns:0 frame:0
          TX packets:198 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:9900 (9.9 KB) TX bytes:9900 (9.9 KB)

lxcbr0 Link encap:Ethernet HWaddr 06:6b:1b:b9:20:f6
          inet addr:10.0.3.1 Bcast:10.0.3.255 Mask:255.255.255.0
          inet6 addr: fe80::46b:1bff:feb9:20f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:648 (648.0 B)

veth2L1AU3 Link encap:Ethernet HWaddr fe:8c:29:79:be:03
          inet6 addr: fe80::fc8c:29ff:fe79:be03/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:110 errors:0 dropped:0 overruns:0 frame:0
          TX packets:213 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:32676 (32.6 KB) TX bytes:68126 (68.1 KB)

veth7RKGAH Link encap:Ethernet HWaddr fe:8b:6f:87:1e:61
          inet6 addr: fe80::fc8b:6fff:fe87:1e61/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:131 errors:0 dropped:0 overruns:0 frame:0
          TX packets:202 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:40122 (40.1 KB) TX bytes:62052 (62.0 KB)

virbr0 Link encap:Ethernet HWaddr 9e:c7:cb:65:79:17
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txq...

Read more...

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Please, attach the lshw XML output from MAAS for one of the affected machines. It should explain how eth2 was picked for juju-br0.

Revision history for this message
Larry Michel (lmic) wrote :

Here's lshw.xml for 2 of the systems.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I found the issue - we're ignoring the disabled="true" attribute on <network /> elements in the parsed lshw XML dump for a given node. I'm working on a fix, which will be backported to 1.20.14 and 1.21.

Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Dimiter Naydenov (dimitern)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Dimiter Naydenov (dimitern) wrote :
JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.