Bundle deploys fail at lxc-start when bridge br-eth1 is created

Bug #1549545 reported by John George
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Andrew McDermott

Bug Description

As seen in
     http://reports.vapour.ws/releases/issue/56ce4986749a5661b7a35aeb

Juju CI MAAS 1.9 provider tests provision to machines that are managed by MAAS over their second NIC.
Native Juju bundle deployment fails to start the LXC template, with the message 'The container failed to start.'
See http://reports.vapour.ws/releases/3666/job/maas-1_9-deployer/attempt/258 for more detail. Note although this link is to a test performed for the maas-spaces2 branch we also see this fail on master.

Manually logging into the machine and capturing lxc-start debug output shows that br-eth0 is not found.
If the machine is reconfigured, so MAAS manages the machine using the first NIC, lxc-start succeeds and the bundle deployment completes.

lxc-start --name juju-trusty-lxc-template --logpriority DEBUG --logfile /tmp/juju-trusty-lxc-template.debug

lxc-start 1456342861.218 INFO lxc_start_ui - lxc_start.c:main:264 - using rcfile /var/lib/lxc/juju-trusty-lxc-template/config
      lxc-start 1456342861.220 WARN lxc_log - log.c:lxc_log_init:316 - lxc_log_init called with log already initialized
      lxc-start 1456342861.221 WARN lxc_cgmanager - cgmanager.c:cgm_get:985 - do_cgm_get exited with error
      lxc-start 1456342861.221 INFO lxc_lsm - lsm/lsm.c:lsm_init:48 - LSM security driver AppArmor
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .reject_force_umount # comment this to allow umount -f; not recommended.
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:426 - Adding native rule for reject_force_umount action 0
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:do_resolve_add_rule:216 - Setting seccomp rule to reject force umounts

      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:429 - Adding compat rule for reject_force_umount action 0
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:do_resolve_add_rule:216 - Setting seccomp rule to reject force umounts

      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .[all].
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .kexec_load errno 1.
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:426 - Adding native rule for kexec_load action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:429 - Adding compat rule for kexec_load action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .open_by_handle_at errno 1.
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:426 - Adding native rule for open_by_handle_at action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:429 - Adding compat rule for open_by_handle_at action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .init_module errno 1.
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:426 - Adding native rule for init_module action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:429 - Adding compat rule for init_module action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .finit_module errno 1.
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:426 - Adding native rule for finit_module action 327681
      lxc-start 1456342861.221 WARN lxc_seccomp - seccomp.c:do_resolve_add_rule:233 - Seccomp: got negative # for syscall: finit_module
      lxc-start 1456342861.221 WARN lxc_seccomp - seccomp.c:do_resolve_add_rule:234 - This syscall will NOT be blacklisted
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:429 - Adding compat rule for finit_module action 327681
      lxc-start 1456342861.221 WARN lxc_seccomp - seccomp.c:do_resolve_add_rule:233 - Seccomp: got negative # for syscall: finit_module
      lxc-start 1456342861.221 WARN lxc_seccomp - seccomp.c:do_resolve_add_rule:234 - This syscall will NOT be blacklisted
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:324 - processing: .delete_module errno 1.
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:426 - Adding native rule for delete_module action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:429 - Adding compat rule for delete_module action 327681
      lxc-start 1456342861.221 INFO lxc_seccomp - seccomp.c:parse_config_v2:436 - Merging in the compat seccomp ctx into the main one
      lxc-start 1456342861.222 DEBUG lxc_conf - conf.c:lxc_create_tty:3791 - allocated pty '/dev/pts/2' (5/6)
      lxc-start 1456342861.222 DEBUG lxc_conf - conf.c:lxc_create_tty:3791 - allocated pty '/dev/pts/4' (7/8)
      lxc-start 1456342861.222 DEBUG lxc_conf - conf.c:lxc_create_tty:3791 - allocated pty '/dev/pts/5' (9/10)
      lxc-start 1456342861.222 DEBUG lxc_conf - conf.c:lxc_create_tty:3791 - allocated pty '/dev/pts/6' (11/12)
      lxc-start 1456342861.222 INFO lxc_conf - conf.c:lxc_create_tty:3802 - tty's configured
      lxc-start 1456342861.222 DEBUG lxc_start - start.c:setup_signal_fd:263 - sigchild handler set
      lxc-start 1456342861.222 DEBUG lxc_console - console.c:lxc_console_peer_default:500 - opening /dev/tty for console peer
      lxc-start 1456342861.222 DEBUG lxc_console - console.c:lxc_console_peer_default:506 - using '/dev/tty' as console
      lxc-start 1456342861.222 DEBUG lxc_console - console.c:lxc_console_sigwinch_init:179 - 6274 got SIGWINCH fd 17
      lxc-start 1456342861.222 DEBUG lxc_console - console.c:lxc_console_winsz:88 - set winsz dstfd:14 cols:204 rows:57
      lxc-start 1456342861.222 INFO lxc_start - start.c:lxc_init:463 - 'juju-trusty-lxc-template' is initialized
      lxc-start 1456342861.223 DEBUG lxc_start - start.c:__lxc_start:1099 - Not dropping cap_sys_boot or watching utmp
      lxc-start 1456342861.248 ERROR lxc_conf - conf.c:instantiate_veth:3105 - failed to attach 'vethLMYA0B' to the bridge 'br-eth0': No such device
      lxc-start 1456342861.255 ERROR lxc_conf - conf.c:lxc_create_network:3388 - failed to create netdev
      lxc-start 1456342861.255 ERROR lxc_start - start.c:lxc_spawn:856 - failed to create the network
      lxc-start 1456342861.255 ERROR lxc_start - start.c:__lxc_start:1121 - failed to spawn 'juju-trusty-lxc-template'
      lxc-start 1456342861.255 ERROR lxc_start_ui - lxc_start.c:main:341 - The container failed to start.
      lxc-start 1456342861.255 ERROR lxc_start_ui - lxc_start.c:main:345 - Additional information can be obtained by setting the --logfile and --logpriority options.

Just switching the NIC order at the machine is not enough, MAAS 1.9 needs to re-commission the node so it re-associates the device name. The following /sys/class/net entries are from after switching the NICs but before re-commissioning the node; note eth1 is still associated with the device in pci slot 3. After re-commissioning, eth0 is associated with the device at slot 3 and eth1 with the device at slot 7.

root@maas-210-node-21:/sys/class/net# ls -l /sys/class/net
total 0
lrwxrwxrwx 1 root root 0 Feb 24 20:39 br-eth1 -> ../../devices/virtual/net/br-eth1
lrwxrwxrwx 1 root root 0 Feb 24 20:39 eth0 -> ../../devices/pci0000:00/0000:00:07.0/virtio4/net/eth0
lrwxrwxrwx 1 root root 0 Feb 24 20:38 eth1 -> ../../devices/pci0000:00/0000:00:03.0/virtio0/net/eth1
lrwxrwxrwx 1 root root 0 Feb 24 20:38 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx 1 root root 0 Feb 24 20:44 lxcbr0 -> ../../devices/virtual/net/lxcbr0

John George (jog)
Changed in juju-core:
importance: Undecided → High
tags: added: ci deploy maas-provider test-failure
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I think I know why this issue happens.
Since we introduced the multi-bridge script for the MAAS provider, we shouldn't just use the "br-eth0" (instancecfg.DefaultBridgeName - that value should be removed) as a bridge device for all container NICs in container.BridgeNetworkConfig(), but instead use the instancecfg.DefaultBridgePrefix to build correct lxc.network.link setting of lxc.conf.

Changed in juju-core:
status: New → In Progress
assignee: nobody → Dimiter Naydenov (dimitern)
milestone: none → 2.0-beta2
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I've analyzed the was to fix this and couldn't find a good way to fix this without having the machine interfaces and addresses model in state. So I'll put this on hold until we have this model, which will be some time next week.

Changed in juju-core:
status: In Progress → Triaged
assignee: Dimiter Naydenov (dimitern) → nobody
Revision history for this message
Andrew McDermott (frobware) wrote :

If these CI tests will make master fail can we temporarily make these particular CI tests non-voting?

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
Revision history for this message
John A Meinel (jameinel) wrote :

It sounds like an option would be to switch MAAS to provision on the first NIC until support for second NIC has been improved. But we will want to make sure not to regress the latter once it's available.

Curtis Hovey (sinzui)
description: updated
Changed in juju-core:
assignee: nobody → Dimiter Naydenov (dimitern)
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Unfortunately, fixing this depends on some ongoing work around handling multi-NIC containers (and multi-bridge network setup now used for them). It will be addressed very soon though. Until then I suggest making this CI job non-voting.

Revision history for this message
Andrew McDermott (frobware) wrote :

Going to investigate reverting the behaviour of bridging all interfaces until we land the full complement of multi-NIC container support into master.

Changed in juju-core:
assignee: Dimiter Naydenov (dimitern) → Andrew McDermott (frobware)
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Andrew McDermott (frobware) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I was able to test the branch above and it resolved the issue in my environment.

I first verified that I could recreate the issue on master using a trusty node with two NICs, and deploying a trusty container.

Deploying with the above branch, I was able to deploy to a trusty container on the trusty node with two NICs and the container and juju agents started properly.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta2 → none
milestone: none → 2.0-beta2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.