I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS and observed dataplane downtime when restarting/stopping neutron-l3 container on controllers. This is what I did:
1. Created a network, subnet, router, a VM and attached a FIP to the VM
2. Left a ping running on the undercloud to the FIP
3. Stopped l3 container in controller-0.
Result: Observed some packet loss while the router was failed over to controller-1
4. Stopped l3 container in controller-1
Result: Observed some packet loss while the router was failed over to controller-2
5. Stopped l3 container in controller-2
Result: No traffic to/from the FIP at all.
(overcloud) [stack@undercloud ~]$ ping 10.0.0.131
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=1.83 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=1.56 ms
<---- Last l3 container was stopped here (step 5 above)---->
From 10.0.0.1 icmp_seq=10 Destination Host Unreachable
From 10.0.0.1 icmp_seq=11 Destination Host Unreachable
When containers are stopped, I guess that the qrouter namespace is not accessible by the kernel:
[heat-admin@overcloud-controller-2 ~]$ sudo ip netns e qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument
This means that not only we're getting controlplane downtime but also dataplane which could be seen as a regression when compared to non-containerized environments.
The same would happen with DHCP and I expect instances not being able to fetch IP addresses from dnsmasq when dhcp containers are stopped.
Further details:
This happens because the containers are mounting host /run in their own /run and namespaces are left behind after stopping/restarting the namespaces as these bug show [0][1]. I applied [2] and now stopping the container will still cause dataplane downtime but also restarting containers simply won't work (we may need additional bug for this).
Namespaces can't be now seen from outside the containers:
[heat-admin@ overcloud- controller- 2 ~]$ sudo ip netns | grep qrouter overcloud- controller- 2 ~]$
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
[heat-admin@
But from inside the container, they can:
[heat-admin@ overcloud- controller- 2 ~]$ sudo docker exec --user root -it 9f8a322c4a3c bash overcloud- controller- 2 /]# ip netns | grep qrouter 5244e91c- f533-4128- 9289-f37c965679 2c
()[root@
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
qrouter-
However, l3 agent fails to initialize because it can't access to them after restart:
()[root@ overcloud- controller- 2 /]# ip netns exec qrouter- 5244e91c- f533-4128- 9289-f37c965679 2c ip a 5244e91c- f533-4128- 9289-f37c965679 2c" failed: Invalid argument
RTNETLINK answers: Invalid argument
setting the network namespace "qrouter-
If I manually delete the namespace from inside the container and restart it, it'll work again:
()[root@ overcloud- controller- 2 /]# ip netns del qrouter- 5244e91c- f533-4128- 9289-f37c965679 2c
RTNETLINK answers: Invalid argument
()[root@ overcloud- controller- 2 /]# ip netns del qrouter- 5244e91c- f533-4128- 9289-f37c965679 2c netns/qrouter- 5244e91c- f533-4128- 9289-f37c965679 2c": No such file or directory
Cannot remove namespace file "/var/run/
[heat-admin@ overcloud- controller- 2 ~]$ sudo docker restart 9f8a322c4a3c
And now ping to the FIP works back again:
(overcloud) [stack@undercloud ~]$ sudo ping 10.0.0.131 -i 0.2
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=38.5 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=6.58 ms
64 bytes from 10.0.0.131: icmp_seq=3 ttl=63 time=5.28 ms
64 bytes from 10.0.0.131: icmp_seq=4 ttl=63 time=2.71 ms
64 bytes from 10.0.0.131: icmp_seq=5 ttl=63 time=0.980 ms