* a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general);
* if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough;
* multi-homing with hosts located on different L2 networks requires more intelligent routing:
- "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space;
- a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds);
- even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR);
- there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space;
- while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed;
* existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host);
* using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.);
* using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure.
Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves.
Goals:
* avoid turning individual hosts into routers;
* avoid complex static rules;
* better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure);
* reduce operational complexity (custom L3 infrastructure integration for each deployment);
* reduce delivery risks (L3 infrastructure, L3 department responsiveness varies);
* avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons.
What does it mean for Juju to support VRF devices?
* enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2);
* the above is per network namespace so it will work equally well in a LXD container;
# ip link add mgmt type vrf table 1 && ip link set dev mgmt up
# ip link add pub type vrf table 2 && ip link set dev pub up
# ip link set mgmtbr0 master management
# ip link set pubbr0 master public
# make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0)
charm-related:
* (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems;
* (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications.
Notes:
* Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS;
* We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc.
* Linux kernel functionality was mostly upstreamed in 4.4;
* Linux kernel only while a unit agent can run on Windows too (nothing we can do here).
or to specify the output device using cmsg and IP_PKTINFO.
TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across ***all VRF domains*** by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
Problem description:
* a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general);
* if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough;
* multi-homing with hosts located on different L2 networks requires more intelligent routing: direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space;
- "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space;
- a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds);
- even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR);
- there is no longer a single "default gateway" as applications need either per-logical-
- while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed;
* existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host);
* using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.);
* using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure.
Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/ destination- based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves.
Goals:
* avoid turning individual hosts into routers;
* avoid complex static rules;
* better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure);
* reduce operational complexity (custom L3 infrastructure integration for each deployment);
* reduce delivery risks (L3 infrastructure, L3 department responsiveness varies);
* avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons.
NOTE: https:/ /cumulusnetwork s.com/blog/ vrf-for- linux/ - I recommend to read this post to understand suggestions below.
How to solve it?
What does it mean for Juju to support VRF devices?
* enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2);
* the above is per network namespace so it will work equally well in a LXD container;
Conceptually:
# echo 'net.ipv4. tcp_l3mdev_ accept = 1' >> /etc/sysctl.conf udp_l3mdev_ accept = 1' >> /etc/sysctl.conf
# echo 'net.ipv4.
# sysctl -p
# ip link add mgmt type vrf table 1 && ip link set dev mgmt up
# ip link add pub type vrf table 2 && ip link set dev pub up
# ip link set mgmtbr0 master management
# ip link set pubbr0 master public
# make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0)
charm-related:
* (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems;
* (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications.
Notes:
* Let's follow rule number 6 (https:/ /tools. ietf.org/ html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS;
* We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc.
* Linux kernel functionality was mostly upstreamed in 4.4;
* Linux kernel only while a unit agent can run on Windows too (nothing we can do here).
Implementation description:
1. Kernel
4.4 (GA xenial)
* CONFIG_NET_VRF=m - present in xenial GA kernels kernel. ubuntu. com/git/ ubuntu/ ubuntu- xenial. git/tree/ debian. master/ config/ config. common. ubuntu? id=2c5158e82d49 7c5eb90d6e2b8aa f07d36cb175f6# n5172
http://
* CONFIG_ NET_L3_ MASTER_ DEV=y - present in xenial GA kernels kernel. ubuntu. com/git/ ubuntu/ ubuntu- xenial. git/tree/ debian. master/ config/ config. common. ubuntu? id=2c5158e82d49 7c5eb90d6e2b8aa f07d36cb175f6# n5109
http://
backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY:
6dd9a14e92e5489 5e143f10fef4d0b 9abe109aa9 (tcp_l3mdev_accept) 5a22b72670c434b f12fa0e3b8 (udp_l3mdev_accept)
63a6fff353d01da
only `ip vrf exec` related - NOT required for baseline functionality:
* http:// man7.org/ linux/man- pages/man8/ ip-vrf. 8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge)
2. User space (iproute2)
iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04.
More specific functionality like `ip vrf exec <vrf-name>` is available in later versions:
https:/ /git.kernel. org/pub/ scm/linux/ kernel/ git/shemminger/ iproute2. git/commit/ ?id=1949f82cdf6 2c074562f04acfb ce40ada0aac7e0 1949f82cdf62c07 4562f04acfbce40 ada0aac7e0
git tag --contains=
v4.10.0
v4.11.0
...
3. MAAS - already hands over per-subnet default gateways
https:/ /github. com/maas/ maas/blob/ 2.3.0/src/ maasserver/ models/ node.py# L3325-L3360 /github. com/maas/ maas/blob/ 2.3.0/src/ maasserver/ api/machines. py#L363- L378
https:/
4. Juju and/or MAAS:
* create VRF devices relevant to network spaces;
* enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers).
5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.
(future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used.
See https:/ /www.kernel. org/doc/ Documentation/ networking/ vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application):
"Applications that are to work within a VRF need to bind their socket to the VRF device:
setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
or to specify the output device using cmsg and IP_PKTINFO.
TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across ***all VRF domains*** by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
sysctl -w net.ipv4. tcp_l3mdev_ accept= 1 udp_l3mdev_ accept= 1"
sysctl -w net.ipv4.
http:// man7.org/ linux/man- pages/man8/ ip-vrf. 8.html
"This ip-vrf command is a helper to run a command against a specific VRF with the VRF association ***inherited parent to child***."
References:
https:/ /en.wikipedia. org/wiki/ Multihoming blog.ipspace. net/2016/ 04/host- to-network- multihoming- kludges. html blog.ipspace. net/2010/ 09/ribs- and-fibs. html
http://
http://
https:/ /cumulusnetwork s.com/blog/ vrf-for- linux/ <--- this is a must-read
https:/ /docs.cumulusne tworks. com/display/ DOCS/Virtual+ Routing+ and+Forwarding+ -+VRF
http:// netdevconf. org/1.2/ session. html?david- ahern-talk
https:/ /www.kernel. org/doc/ Documentation/ networking/ vrf.txt
https:/ /github. com/Mellanox/ mlxsw/wiki/ Virtual- Routing- and-Forwarding- %28VRF% 29
http:// blog.ipspace. net/2016/ 02/running- bgp-on- servers. html /tools. ietf.org/ html/rfc7938
https:/
http:// www.routereflec tor.com/ 2016/11/ working- with-vrf- on-linux/ (usage example on 16.04)