DVR router takes too long to learn octavia LB VIP
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Summary
Hi,
We are facing a connectivity problem when trying to communicate to an octavia loadbalancer over a DVR router. When making an initial request to the LB, the DVR router takes too much time to learn the MAC address of the LB VIP and then it sends the message No route to host.
The DVR router will end up learning the MAC and communication will work on the second and third request but the problem reappears if there are no requests to the LB for over a minute. At this point, the ARP entry disappears from the router's table and it must learn the MAC again.
I expect the dvr router to learn the MAC in ms, not seconds.
I currently see this problem in the Yoga version but it is not a new problem. I detected this in Ussuri as well. I was expecting improvements in Yoga.
Openstack version: Yoga
octavia topology: ACTIVE_STANDBY
Step by step
Create NetworkA
Create two instances with apache (web server) on NetworkA. These will be our LB members.
Create a LB on NetworkA. Create a HTTP listener. Create a pool with that listener. Create two LB members in the pool. The members should be the IP addresses of the two instances created previously.
Create NetworkB
Create an instance on NetworkB. This will be used to curl http://<LB-VIP>.
Create a DVR router. Connect NetworkA and NetworkB to this router.
At this point the ARP table of the router will have permanent ARP entries for all instances on NetworkA and NetworkB including the amphora instances.
It will not have the ARP entry for the LB VIP. I assume that is normal.
Now on the instance on networkB, curl the LB VIP. In my case curl http://
I usually receive the following error.
Failed to connect to 10.86.86.196 port 80: No route to host.
If I try again right after the failed attempt, it works! I see the output of my web server.
I did some packet captures on the dvr router on the compute server and I also watched its ARP table.
At first there is no ARP entry for the LB VIP.
Then I made the request. Here is the tcpdump output of the router interface connected to NetworkB.
ip netns exec qrouter-
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on qr-053b19e7-67, link-type EN10MB (Ethernet), capture size 262144 bytes
18:07:36.888353 IP 10.87.87.16.56450 > 10.86.86.196.80: Flags [S], seq 4098439773, win 26730, options [mss 8910,sackOK,TS val 1768131196 ecr 0,nop,wscale 7], length 0
18:07:37.913880 IP 10.87.87.16.56450 > 10.86.86.196.80: Flags [S], seq 4098439773, win 26730, options [mss 8910,sackOK,TS val 1768132221 ecr 0,nop,wscale 7], length 0
18:07:39.929891 IP 10.87.87.16.56450 > 10.86.86.196.80: Flags [S], seq 4098439773, win 26730, options [mss 8910,sackOK,TS val 1768134237 ecr 0,nop,wscale 7], length 0
18:07:39.946622 IP 10.87.87.1 > 10.87.87.16: ICMP host 10.86.86.196 unreachable, length 68
18:07:39.946775 IP 10.87.87.1 > 10.87.87.16: ICMP host 10.86.86.196 unreachable, length 68
18:07:39.946869 IP 10.87.87.1 > 10.87.87.16: ICMP host 10.86.86.196 unreachable, length 68
We can see that there are 3 requests made from my instance to the LB VIP at 1 second intervals.
Then the router responds with ICMP host 10.86.86.196 unreachable. This is why I see on the instance the error "no route to host".
Here is the tcpdump output of the router interface connected to NetworkA for the same request.
ip netns exec qrouter-
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on qr-1622f027-67, link-type EN10MB (Ethernet), capture size 262144 bytes
18:07:36.888401 ARP, Request who-has 10.86.86.196 tell 10.86.86.1, length 28
18:07:37.898590 ARP, Request who-has 10.86.86.196 tell 10.86.86.1, length 28
18:07:38.926591 ARP, Request who-has 10.86.86.196 tell 10.86.86.1, length 28
18:07:41.345337 ARP, Request who-has 10.86.86.196 (ff:ff:ff:ff:ff:ff) tell 10.86.86.196, length 28
18:07:41.345399 ARP, Request who-has 10.86.86.196 (ff:ff:ff:ff:ff:ff) tell 10.86.86.196, length 28
We see that it makes 3 ARP requests to get the MAC but no reply.
The last two packets in the tcpdump is the LB itself checking that no one else is using the IP 10.86.86.196.
I do see that it does learn the MAC but it's too late.
What is strange is that when it does learn it, I do not see the ARP reply.
Since the ARP entry disappears after a minute or so, this problem happens often. There are times where it works on the first try but it is rare. Even when it works, it still takes the router 2 seconds to learn which is slightly faster then 3 seconds when it fails.
Note: the LB, the LB members and the instance are not on the same compute server.
Lastly, I do not see any problems if my instance communicating with the LB is on the same network as the LB.
Furthermore, if I assign a FIP to the LB and communicate to the LB from the internet, I do not see any problems. The SNAT router namespace is able to learn the MAC quickly, every time.
This is very specific to the DVR router (qrouter namespace) on the compute servers.
We have quite a few users on different installations with similar architecture complaining that they are facing random communication problems because of the no route to host error explained above.
tags: | added: l3-dvr-backlog |
Have you reported this with Octavia as well? We might need their help to fix this