node with unconfigured bonded nic can't be reached

Bug #1578333 reported by Francis Ginther
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
1.9
Won't Fix
Undecided
Unassigned

Bug Description

I'm using maas 1.9.2+bzr4568-0ubuntu1 (trusty1) and testing with bonded nics.

I have a vmware node setup with 4 eth nics, all connected to the same network. I've bonded the first two into bond0, configured the subnet and set it to "Auto assign". The remaining two eth nics are configured to the same subnet, but their IP remains unconfigured (see working-with-eth-nics.png). I then deploy trusty and am able to SSH in once it's deployed to the IP assigned to bond0.

Now, I take the same node and bond eth2 and eth3 together into bond1. This is then configured to the same subnet but again there is no IP configured. Bond0 remains unmodified and is still set to the same subnet and "auto assign". When I deploy this with trusty, I'm never able to ssh in to the IP assigned to bond0 (see failed-with-bonded-nics.png).

As bond0 has an IP in both cases, I would expect it to be reachable in either of the above two configurations, ignoring the other unconfigured nics.

[dpkg -l '*maas*|cat]
$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-====================================-================================-============-===============================================================================
ii maas 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.2+bzr4568-0ubuntu1~trusty1 all MAAS server provisioning libraries

[/var/log/maas/*]
https://private-fileshare.canonical.com/~fginther/maas/maas.tgz

Tags: landscape
Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi There,

Can you please share with us you /e/n/i and the output of: maas <user> node get-curtin-config <system-id>

Changed in maas:
status: New → Incomplete
milestone: none → 2.0.0
Revision history for this message
Francis Ginther (fginther) wrote :

/etc/network/interfaces, part 1 of 2.

Revision history for this message
Francis Ginther (fginther) wrote :

/etc/network/interfaces, part 2 of 2.

Revision history for this message
Francis Ginther (fginther) wrote :

output of: maas <user> node get-curtin-config <system-id>

David Britton (dpb)
Changed in maas:
status: Incomplete → New
Revision history for this message
Mike Pontillo (mpontillo) wrote :

We need to narrow this down between a L2 issue and a L3 issue.

Is the machine running the SSH client on the same L2 network (and L3 subnet) as the deployed node? (if not, can you try it from the same subnet and let me know if it works?) If it works on-subnet but not off-subnet, that means it could be an issue with the reachability of the default gateway. But I noticed the gateway isn't mentioned in the curtin config either; can you verify that the subnet you're deploying to has a default gateway defined?

Once you've double-checked the gateway, if it still doesn't work, it would be helpful to take a step back and try pining the node, preferably from a host on the same subnet. If you want, SSH to a deployed node on the same subnet and run something like:

tcpdump -s 0 'port not 22' -n -w debug.pcap

Then try pining the broken node and SSHing to it, and debug.pcap should contain some interesting information, hopefully.

Revision history for this message
David Britton (dpb) wrote : Re: [Bug 1578333] Re: node with unconfigured bonded nic can't be reached

All nics are on the same layer 2.

On Fri, May 6, 2016 at 7:06 PM, Mike Pontillo <email address hidden>
wrote:

> We need to narrow this down between a L2 issue and a L3 issue.
>
> Is the machine running the SSH client on the same L2 network (and L3
> subnet) as the deployed node? (if not, can you try it from the same
> subnet and let me know if it works?) If it works on-subnet but not off-
> subnet, that means it could be an issue with the reachability of the
> default gateway. But I noticed the gateway isn't mentioned in the curtin
> config either; can you verify that the subnet you're deploying to has a
> default gateway defined?
>
> Once you've double-checked the gateway, if it still doesn't work, it
> would be helpful to take a step back and try pining the node, preferably
> from a host on the same subnet. If you want, SSH to a deployed node on
> the same subnet and run something like:
>
> tcpdump -s 0 'port not 22' -n -w debug.pcap
>
> Then try pining the broken node and SSHing to it, and debug.pcap should
> contain some interesting information, hopefully.
>
> --
> You received this bug notification because you are a member of
> Landscape, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1578333
>
> Title:
> node with unconfigured bonded nic can't be reached
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1578333/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.0.0; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: product=maas; productseries=1.9; milestone=1.9.3;
> status=New; importance=Undecided; assignee=None;
> Launchpad-Bug-Tags: landscape
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl fginther mpontillo
> Launchpad-Bug-Reporter: Francis Ginther (fginther)
> Launchpad-Bug-Modifier: Mike Pontillo (mpontillo)
> Launchpad-Message-Rationale: Subscriber @landscape
> Launchpad-Message-For: landscape
>

--
David Britton <email address hidden>

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Can you try two things for me:

(1) Configure the bonds in balance-rr mode
(2) Make sure the bond interfaces are configured with the MAC from one of its parent nodes

If the bonds "work" in this configuration, (in fact, they may *appear* to work, but really they would be dropping about ~50% of the traffic) that's a clue that VMware is configured to drop packets directed to MACs the NIC does not own. (this is an optimization I've heard of problems in the past with, when trying to use containers on top of a VMware vSwitch; it could be the same issue here.)

Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

We believe this is no longer an issue in the latest releases of MAAS. Please upgrade to the latest version of MAAS, and If you believe this issue is still present, please re-open this bug report or file a new one.

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.