Canonical Juju

Bug #1737428
Activity log

Activity log for bug #1737428

Date	Who	What changed	Old value	New value	Message
2017-12-10 20:09:02	Dmitrii Shcherbakov	bug			added bug
2017-12-10 20:09:14	Dmitrii Shcherbakov	bug task added		maas
2017-12-10 20:13:02	Dmitrii Shcherbakov	bug task added		linux (Ubuntu)
2017-12-10 20:18:35	Dmitrii Shcherbakov	bug			added subscriber John A Meinel
2017-12-10 20:18:44	Dmitrii Shcherbakov	bug			added subscriber Witold Krecicki
2017-12-10 20:19:02	Dmitrii Shcherbakov	bug			added subscriber Ante Karamatić
2017-12-10 20:30:07	Ubuntu Kernel Bot	linux (Ubuntu): status	New	Incomplete
2017-12-10 20:54:35	Dmitrii Shcherbakov	description	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 2 && ip link set dev pub up # ip link set mgmtbr0 master management # ip link set pubbr0 master public # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 2 && ip link set dev pub up # ip link add storacc type vrf table 1 && ip link set dev mgmt up # ip link add storrepl type vrf table 2 && ip link set dev pub up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)
2017-12-10 20:57:45	Dmitrii Shcherbakov	description	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 2 && ip link set dev pub up # ip link add storacc type vrf table 1 && ip link set dev mgmt up # ip link add storrepl type vrf table 2 && ip link set dev pub up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 2 && ip link set dev pub up # ip link add storacc type vrf table 1 && ip link set dev mgmt up # ip link add storrepl type vrf table 2 && ip link set dev pub up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)
2017-12-10 21:01:20	Dmitrii Shcherbakov	description	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 2 && ip link set dev pub up # ip link add storacc type vrf table 1 && ip link set dev mgmt up # ip link add storrepl type vrf table 2 && ip link set dev pub up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 10 && ip link set dev pub up # ip link add storacc type vrf table 20 && ip link set dev mgmt up # ip link add storrepl type vrf table 30 && ip link set dev pub up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)
2017-12-10 21:02:35	Dmitrii Shcherbakov	description	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 10 && ip link set dev pub up # ip link add storacc type vrf table 20 && ip link set dev mgmt up # ip link add storrepl type vrf table 30 && ip link set dev pub up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 10 && ip link set dev pub up # ip link add storacc type vrf table 20 && ip link set dev storacc up # ip link add storrepl type vrf table 30 && ip link set dev storrepl up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)
2017-12-11 09:47:25	Sandor Zeestraten	bug			added subscriber Sandor Zeestraten
2017-12-11 15:02:53	Andres Rodriguez	maas: status	New	Incomplete
2017-12-11 15:02:56	Andres Rodriguez	maas: importance	Undecided	Wishlist
2017-12-11 15:02:57	Andres Rodriguez	maas: milestone		next
2017-12-11 20:21:01	Joseph Salisbury	tags	cpe-onsite	cpe-onsite kernel-da-key
2017-12-11 20:21:13	Joseph Salisbury	linux (Ubuntu): importance	Undecided	Wishlist
2017-12-11 20:48:56	Dmitrii Shcherbakov	description	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 10 && ip link set dev pub up # ip link add storacc type vrf table 20 && ip link set dev storacc up # ip link add storrepl type vrf table 30 && ip link set dev storrepl up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)	Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); (see 3.3.4 Local Multihoming https://tools.ietf.org/html/rfc1122#page-60 and 3.3.4.2 Multihoming Requirements) * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 10 && ip link set dev pub up # ip link add storacc type vrf table 20 && ip link set dev storacc up # ip link add storrepl type vrf table 30 && ip link set dev storrepl up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across *all VRF domains* by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association *inherited parent to child*." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)
2017-12-21 15:25:29	Dmitrii Shcherbakov	bug watch added		https://github.com/CanonicalLtd/maas-docs/issues/737
2018-01-08 02:57:29	Anastasia	juju: status	New	Incomplete
2018-01-08 02:57:32	Anastasia	juju: importance	Undecided	Wishlist
2018-02-01 13:00:46	Peter Sabaini	bug			added subscriber Peter Sabaini
2018-09-12 12:30:38	Dominique Poulain	bug			added subscriber Dominique Poulain
2020-01-17 15:29:31	Ante Karamatić	tags	cpe-onsite kernel-da-key	cpe-onsite kernel-da-key sts
2020-01-20 10:47:08	Amad Ali	bug			added subscriber Amad Ali
2020-01-23 14:50:37	Dan Streetman	bug			added subscriber Dan Streetman
2020-05-05 07:46:55	Björn Tillenius	maas: status	Incomplete	Invalid
2020-05-07 16:06:38	Mateusz Pawlowski	bug			added subscriber Mateusz Pawlowski
2020-10-02 15:34:57	Dimitri John Ledkov	bug task added		netplan.io (Ubuntu)
2020-10-02 15:35:28	Dimitri John Ledkov	netplan.io (Ubuntu): status	New	Confirmed
2020-10-02 15:35:29	Dimitri John Ledkov	netplan.io (Ubuntu): importance	Undecided	Medium
2021-08-24 09:48:48	Björn Tillenius	maas: milestone	next
2022-07-25 05:21:20	Brett Milford	bug			added subscriber Brett Milford
2022-10-13 09:54:18	Lukas Märdian	netplan.io (Ubuntu): status	Confirmed	Fix Released