Default gateway for LXD containers cannot be influenced/changed

Bug #1781856 reported by Trent Lloyd
52
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Sometimes you may wish to use a different interface as your default gateway, this is not currently possible for LXD containers. Secondly to that, the default gateway for the "default space" is not used - it appears to use the first space in the database or some other arbitrary ordering.

Currently, with the MAAS provider, while not in the Web UI you can specify a different interface as the default gateway interface using the following command:
$ maas skc interface set-default-gateway wftkqx 914

You specify the machine ID (wftkqx) and the ID of the link that you want to be the default gateway (914). For testing purposes, the following command makes it easy to identify the link IDs for a given machine's interfaces:
maas skc nodes read | jq '.[] | . as $parent | .interface_set[] | "system_id=\($parent.system_id) hostname=\($parent.hostname) if_id=\(.id), if_name=\(.name), if_fabric=\(.vlan.fabric), if_space=\(.vlan.space)"'

Unfortunately this logic does not extend to LXD containers deployed by juju. juju appears to always use the "first" space as the default gateway, and by first I am not sure but I am guessing it is based on the order it appears in the database or similar.

In my setup I have spaces "vsw0" and "vsw1" (in that order in the MAAS interface and "juju spaces" output). When deploying a container with both spaces on a host for which MAAS has the default gateway set to vsw1, it appears vsw0 is always chosen as the container gateway. This includes when you set a different "default space"

juju deploy percona-cluster --bind "vsw1 shared-db=vsw0" -n2 --to lxd,lxd

The main reason this cause real problems, is that currently containers don't attempt to do any kind of source routing. So when a container is contacted on one interface, the default gateway for another interface is used. This leads to broken networking in some setups.

This could potentially also be solved by Bug #1737428 which allows for different default routes basic on the traffic "Source IP" - however it makes sense without that support to simply be able to specify the default route for the container - at the very least, perhaps by using the default space or otherwise if that is a problem for some reason perhaps through some other option.

Tags: sts
Trent Lloyd (lathiat)
tags: added: sts
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.2 KiB)

Since you linked the bug I created about multi-homing you might know the workarounds already but I will summarize them just in case.

One of the ways to workaround the problem is using source-based policy routing (as you mentioned) for receiving TCP traffic. For sending traffic static routes have to be used as the destination host has to be known to direct traffic to the right outbound hop.

I suspect that using a charm to set policy rules can be a problem if the "first space" is not the one that needs to be used to contact the Juju controller from a machine/unit agent and the controller is not on the same L2 (in a different subnet) - this case could be quite relevant with L3 leaf-spine deployments with Juju HA enabled.

Regardless of how this is applied (cloud-init or charm), the following could be used:

1) with TCP (even with using an unbound listening socket - 0.0.0.0/INADDR_ANY, see man 2 listen and man 7 ip), you can rely on the fact that a connected socket of your TCP server will use a source address that was specified as a destination address on a client. When the client creates its own socket (5-tuple) to establish a TCP connection, it does not expect a source address of a response packet to magically change. Unless there is a broken NAT configuration, the receiving host with the TCP server uses received_packet.destination_addr as connected_socket.source_addr.

This allows you to avoid static routes and handle "unknown sender" scenarios correctly for receiving traffic with the following rules:

CIDR=192.168.1.0/24 # e.g. if you have eth1 <-> 192.168.1.10
ip route add default via $GATEWAY table $TABLE # add a default route to a different table
# add a policy rule to use per-interface-subnet routing tables without hitting rp_filter by using asymmetric routing
ip rule add from $CIDR table $TABLE priority $PRIORITY

The trick is that a request will come to 192.168.1.10 from, say, 1.1.1.1 and a response source address will be selected as 192.168.1.10. The TCP server's kernel will then inspect the response packet source address and forward it using a $GATEWAY in $TABLE. This might be counter-intuitive as the locally-generated response is a subject of a policy rule - not a request.

A simple charm that could be used for that lives here (it can be improved to avoid hard-coding the interface):
https://git.launchpad.net/~canonical-bootstack/charm-policy-routing/tree/hooks/config-changed
https://jujucharms.com/u/canonical-bootstack/policy-routing

2) For UDP and unbound sockets (INADDR_ANY) the problem is that you only have one receiving (listening) socket and no connected socket. Your UDP server kernel figures out a source address to use during sendto(2) execution (getsockname would get the result). This is nicely summarized here: http://laforge.gnumonks.org/blog/20171020-local_ip_unbound_udp/

Fortunately, most of our workloads are TCP and we do not hit that problem that often. For OpenStack deployments designate-bind might be problematic in case multiple interfaces are used for its container.

3) For sending traffic either static routes or VRF + SO_BINDTODEVICE have to be used as you either have to know exactly how to route to a given end h...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1781856] Re: Default gateway for LXD containers cannot be influenced/changed
Download full text (4.1 KiB)

Changing the default gateway seems very much a bandaid solution, as if you
have >1 interface, sort of by definition you want some traffic to go out
one interface for a reason different to other traffic. And changing the
'default' is going to be wrong for some portion of your traffic.

On Mon, Jul 16, 2018 at 2:40 AM, Dmitrii Shcherbakov <
<email address hidden>> wrote:

> Since you linked the bug I created about multi-homing you might know the
> workarounds already but I will summarize them just in case.
>
> One of the ways to workaround the problem is using source-based policy
> routing (as you mentioned) for receiving TCP traffic. For sending
> traffic static routes have to be used as the destination host has to be
> known to direct traffic to the right outbound hop.
>
> I suspect that using a charm to set policy rules can be a problem if the
> "first space" is not the one that needs to be used to contact the Juju
> controller from a machine/unit agent and the controller is not on the
> same L2 (in a different subnet) - this case could be quite relevant with
> L3 leaf-spine deployments with Juju HA enabled.
>
> Regardless of how this is applied (cloud-init or charm), the following
> could be used:
>
> 1) with TCP (even with using an unbound listening socket -
> 0.0.0.0/INADDR_ANY, see man 2 listen and man 7 ip), you can rely on the
> fact that a connected socket of your TCP server will use a source
> address that was specified as a destination address on a client. When
> the client creates its own socket (5-tuple) to establish a TCP
> connection, it does not expect a source address of a response packet to
> magically change. Unless there is a broken NAT configuration, the
> receiving host with the TCP server uses received_packet.destination_addr
> as connected_socket.source_addr.
>
> This allows you to avoid static routes and handle "unknown sender"
> scenarios correctly for receiving traffic with the following rules:
>
> CIDR=192.168.1.0/24 # e.g. if you have eth1 <-> 192.168.1.10
> ip route add default via $GATEWAY table $TABLE # add a default route to a
> different table
> # add a policy rule to use per-interface-subnet routing tables without
> hitting rp_filter by using asymmetric routing
> ip rule add from $CIDR table $TABLE priority $PRIORITY
>
> The trick is that a request will come to 192.168.1.10 from, say, 1.1.1.1
> and a response source address will be selected as 192.168.1.10. The TCP
> server's kernel will then inspect the response packet source address and
> forward it using a $GATEWAY in $TABLE. This might be counter-intuitive
> as the locally-generated response is a subject of a policy rule - not a
> request.
>
> A simple charm that could be used for that lives here (it can be improved
> to avoid hard-coding the interface):
> https://git.launchpad.net/~canonical-bootstack/charm-
> policy-routing/tree/hooks/config-changed
> https://jujucharms.com/u/canonical-bootstack/policy-routing
>
> 2) For UDP and unbound sockets (INADDR_ANY) the problem is that you only
> have one receiving (listening) socket and no connected socket. Your UDP
> server kernel figures out a source address to use during sendto(2)
> execution (getsockna...

Read more...

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Since there a workaround, described in comment # 1 and jameinel does not think that this is a good solution to a problem, I am marking this report as Invalid for juju.

Changed in juju:
status: New → Invalid
Revision history for this message
Trent Lloyd (lathiat) wrote :

Anastasia: There is no workaround for LXD containers, there is a workaround for MAAS machines but not the relevant LXD containers. This is very much still an issue for LXD containers, the networking for which is controlled entirely by juju. So I think this Invalid state is incorrect.

While there are perhaps "better" solutions (using VNF etc) its still very relevant to want to change the default gateway, it is not always just a bandaid solution depending on the environment.

Revision history for this message
Trent Lloyd (lathiat) wrote :

For clarity of example, we had a customer who required this. The only fix they had was to manually login to the machines and change the interfaces files after deployment.

Changed in juju:
status: Invalid → New
Revision history for this message
Trent Lloyd (lathiat) wrote :

Obvious fix to me would be to use the default space as the gateway, if it has one, otherwise fall back to finding a space that has a gateway.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.5.1
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.5.1 → 2.5.2
Changed in juju:
milestone: 2.5.2 → 2.5.3
Changed in juju:
milestone: 2.5.3 → 2.5.4
Changed in juju:
milestone: 2.5.4 → 2.5.5
Changed in juju:
milestone: 2.5.6 → 2.5.8
Changed in juju:
milestone: 2.5.8 → 2.5.9
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing from a milestone as this work will not be done in 2.5 series.

Changed in juju:
milestone: 2.5.9 → none
Revision history for this message
Diko Parvanov (dparv) wrote :

Trent Lloyd (lathiat) wrote on 2018-10-11:
> Obvious fix to me would be to use the default space as the gateway, if it has one, otherwise fall > back to finding a space that has a gateway.

+1 for this, faced same issue where an application (designate-bind) has bindings to two MAAS subnets, both defined with default gateways and first LXD multiple default gateways in ip r l, second the active one went through dns-frontend space binding, instead of the default space binding.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.