neutron l2 to dhcp lost when migrating in stable/stein 14.0.2
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Medium
|
Slawek Kaplonski |
Bug Description
Info about the environment:
3x controller nodes
50+ compute nodes
all in stable stein, neutron is 14.0.2 using OVS 2.11.0
neutron settings:
- max_l3_
- dhcp_agents_
- router_distributed = true
- interface_driver = openvswitch
- l3_ha = true
l3 agent:
- agent_mode = dvr
ml2:
- type_drivers = flat,vlan,vxlan
- tenant_
- mechanism_drivers = openvswitch,
- extension_drivers = port_security,dns
- external_
tenants may have multiple external networks
instances may have multiple interfaces
tests have been performed on 10 instances launched in a tenant network connected to a router in an external network. all instances have floating ip's assigned. these instances had only 1 interface. this particular testing tenant has rbac's for 4 external networks in which only 1 is used.
migrations have been done via cli with admin:
openstack server migrate --live <new_host> <instance_uuid>
have also tested using evacuate with same results
expected behavior:
when _multiple_ (in the ranges of 10+) instances is migrated simultaneously from one computehost to another, they should come up with a minor network service drop. all l2 should be resumed.
what actually happends:
instances are migrated, some errors pop in neutron/nova and then instances comes up with a minor network service drop. However L2 toward dhcp-servers is totally severed in OVS. The migrated instances will as expected start try renewal of lease half-way through it's current lease and at the end of it drop the IP. Easy test is try renewal of lease on an instance or icmp to any dhcp-server in that vxlan L2.
current workaround:
once the instance is migrated the l2 to dhcp-servers can be re-established by restarting neutron-
how to test:
create instances (10+), migrate and then try to ping neutron dhcp-server in the vxlan (tenant created network) or simply renew dhcp-leases.
error messages:
Exception during message handling: TooManyExternal
other oddities:
when performing migration of small number of instances i.e. 1-4 migrations become successful and L2 with dhcp-servers is not lost.
when looking through debug logs i can't really find anything of relevance. no other large errors/warnings occur other that the one above.
i will perform more test when migrations are successful and/or neutron-
This occurs in a 14.0.0 regression bug which should be fixed in 14.0.2 (this bugreport is for 14.0.2) but it could possible not work with this combination of settings(?).
Please let me know if any versions to api/services is required for this or any configurations or other info.
tags: | added: l3-dvr-backlog |
Here is a full dumpflows from the host 10 instances are on (freshly migrated with the issue at hand). So in this dumpflow they can't communicate with dhcp.
cookie= 0xfc433b151081c c9d, duration= 11448.791s, table=0, n_packets=0, n_bytes=0, priority= 65535,vlan_ tci=0x0fff/ 0x1fff actions=drop 0xfc433b151081c c9d, duration=511.432s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo34809f80- 73",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=497.129s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo4a20f841- b3",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=477.756s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvoe855f573- 8f",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=469.122s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo1fa8144f- c6",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=456.811s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvoba71a357- ea",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=440.239s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvoc711c404- f0",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=425.874s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo697908ee- 00",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=413.444s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo3efbc4cf- ad",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=400.885s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo0634d6e7- a1",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=388.523s, table=0, n_packets=0, n_bytes=0, priority= 10,icmp6, in_port= "qvo46770478- be",icmp_ type=136 actions= resubmit( ,24) 0xfc433b151081c c9d, duration=511.429s, table=0, n_packets=7, n_bytes=294, priority= 10,arp, in_port= "qvo34809f80- 73" actions= resubmit( ,24) 0xfc433b151081c c9d, duration=497.126s, table=0, n_packets=7, n_bytes=294, priority= 10,arp, in_port= "qvo4a20f841- b3" actions= resubmit( ,24) ...
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=