OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
sean mooney | |||
Train |
New
|
Undecided
|
Unassigned | |||
Ussuri |
New
|
Undecided
|
Unassigned | |||
Victoria |
New
|
Undecided
|
Unassigned | |||
Wallaby |
New
|
Undecided
|
Unassigned | |||
neutron | Status tracked in Ussuri | |||||
Train |
In Progress
|
Undecided
|
Unassigned | |||
Ussuri |
In Progress
|
Undecided
|
Rodolfo Alonso | |||
Victoria |
In Progress
|
Undecided
|
Unassigned | |||
Wallaby |
Fix Committed
|
Undecided
|
Unassigned | |||
os-vif |
Invalid
|
Undecided
|
Unassigned |
Bug Description
This issue is well known, and there were previous attempts to fix it, like this one
https:/
This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova libvirt and neutron ovs agent all run inside containers.
So far the only simply fix I have is to increase the number of RARP packets QEMU sends after live-migration from 5 to 10. To be complete, the nova change (not merged) proposed in the above mentioned activity does not work.
I am creating this ticket hoping to get an up-to-date (for Rockey and onwards) expert advise on how to fix in nova-neutron.
For the record, below are the time stamps in my test between neutron ovs agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd last packet barely made through.
openvswitch-
2019-02-14 19:00:13.568 73453 INFO neutron.
{'subnet_id': 'b7c09e83-
], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 'mac_address': 'fa:16:
2019-02-14 19:00:13.568 73453 INFO neutron.
tcpdump for rarp packets:
[root@overcloud
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
tags: | added: ovs |
Changed in neutron: | |
assignee: | Yang Li (yang-li) → sean mooney (sean-k-mooney) |
Changed in os-vif: | |
status: | New → Invalid |
Changed in nova: | |
status: | New → In Progress |
assignee: | nobody → sean mooney (sean-k-mooney) |
importance: | Undecided → Medium |
Changed in neutron: | |
assignee: | sean mooney (sean-k-mooney) → Rodolfo Alonso (rodolfo-alonso-hernandez) |
Changed in neutron: | |
assignee: | Rodolfo Alonso (rodolfo-alonso-hernandez) → Oleg Bondarev (obondarev) |
Changed in neutron: | |
assignee: | Oleg Bondarev (obondarev) → Rodolfo Alonso (rodolfo-alonso-hernandez) |
Changed in nova: | |
assignee: | sean mooney (sean-k-mooney) → nobody |
Changed in neutron: | |
assignee: | Rodolfo Alonso (rodolfo-alonso-hernandez) → nobody |
Changed in neutron: | |
assignee: | nobody → Rodolfo Alonso (rodolfo-alonso-hernandez) |
Changed in nova: | |
assignee: | nobody → sean mooney (sean-k-mooney) |
tags: | added: neutron-proactive-backport-potential |
no longer affects: | nova/xena |
Just want to make sure I understand this correctly after reading the bug you referenced.
1. The instance is live-migrated
2. The neutron-ovs-agent on the target node configures the port, but it's after libvirt has
already starting sending RARPs out
3. A couple of RARPs make it out if you set a config option high enough, for example, 10
I would have figured neutron would have notified nova, and then it would have completed things, triggering the RARPs to happen after that event, but this is not my area of expertise. I'll ask someone that's more familiar with live migration operations to take a look and give their perspective.