However, I can sorta simulate what happens if/when the connection to :6640 is lost, which we did experience in production and Acoss69 referenced in the opening comment. This may help with developing a patch to the OVS agent that could help recover from this condition.
What we see is this: a normal set of flows on the provider bridge (br-ex or br-vlan, in this example):
Every 1.0s: ovs-ofctl dump-flows br-vlan compute1: Tue Mar 3 07:42:42 2020
The agent does not appear to re-implement the proper flows unless you restart the agent.
The only way I have been able to simulate this behavior is by killing the ovsdb-server or better yet, restarting the openvswitch-switch service without subsequently restarting the neutron ovs agent. In this prod environment I mentioned, the connection to :6640 was lost a couple of minutes after the neutron agent was restarted, which caused this 'drop' rule to be implemented until the agent was restarted. This behavior continue ad nauseam on all compute nodes in the environment until I patched the agent.
OVS Version: 2.11.0
Neutron Version: neutron-openvswitch-agent version 14.0.5.dev19
So, I have not seen this issue in production since implementing that small patch in https:/ /bugs.launchpad .net/neutron/ +bug/1864822/ comments/ 4.
However, I can sorta simulate what happens if/when the connection to :6640 is lost, which we did experience in production and Acoss69 referenced in the opening comment. This may help with developing a patch to the OVS agent that could help recover from this condition.
What we see is this: a normal set of flows on the provider bridge (br-ex or br-vlan, in this example):
Every 1.0s: ovs-ofctl dump-flows br-vlan compute1: Tue Mar 3 07:42:42 2020
NXST_FLOW reply (xid=0x4): 0xbe35f1e76f2f0 e27, duration=468.374s, table=0, n_packets=0, n_bytes=0, idle_age=532, priority= 2,in_port= 1 actions= resubmit( ,1) 0xbe35f1e76f2f0 e27, duration=469.071s, table=0, n_packets=0, n_bytes=0, idle_age=532, priority=0 actions=NORMAL 0xbe35f1e76f2f0 e27, duration=468.373s, table=0, n_packets=2, n_bytes=140, idle_age=184, priority=1 actions= resubmit( ,3) 0xbe35f1e76f2f0 e27, duration=468.371s, table=1, n_packets=0, n_bytes=0, idle_age=532, priority=0 actions= resubmit( ,2) 0xbe35f1e76f2f0 e27, duration=467.008s, table=2, n_packets=0, n_bytes=0, idle_age=532, priority= 4,in_port= 1,dl_vlan= 1 actions= mod_vlan_ vid:1111, NORMAL 0xbe35f1e76f2f0 e27, duration=468.370s, table=2, n_packets=0, n_bytes=0, idle_age=532, priority= 2,in_port= 1 actions=drop 0xbe35f1e76f2f0 e27, duration=468.339s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority= 2,dl_src= fa:16:3f: 01:ad:70 actions=output:1 0xbe35f1e76f2f0 e27, duration=468.329s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority= 2,dl_src= fa:16:3f: 15:73:1b actions=output:1 0xbe35f1e76f2f0 e27, duration=468.322s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority= 2,dl_src= fa:16:3f: 49:67:3e actions=output:1 0xbe35f1e76f2f0 e27, duration=468.312s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority= 2,dl_src= fa:16:3f: b8:7d:b0 actions=output:1 0xbe35f1e76f2f0 e27, duration=468.368s, table=3, n_packets=2, n_bytes=140, idle_age=184, priority=1 actions=NORMAL
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
When we see "tcp:127. 0.0.1:6640: send error: Broken pipe" in the neutron- openvswitch- agent.log file, it is followed up with something like this:
... plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Mapping physical network vlan to bridge br-vlan plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Bridge br-vlan datapath-id = 0x000086ce24d0d14a plugins. ml2.drivers. openvswitch. agent.openflow. native. ovs_bridge [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Bridge br-vlan has datapath-ID 000086ce24d0d14a plugins. ml2.drivers. openvswitch. agent.ovs_ dvr_neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] L2 Agent operating in DVR Mode with MAC fa:16:3f:8e:8f:ed plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Physical bridge br-vlan was just re-created. plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Mapping physical network vlan to bridge br-vlan plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Bridge br-vlan datapath-id = 0x000086ce24d0d14a plugins. ml2.drivers. openvswitch. agent.openflow. native. ovs_bridge [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Bridge br-vlan has datapath-ID 000086ce24d0d14a plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-51eb4348- 565b-492b- 8910-30c8bca078 c5 - - - - -] Agent out of sync with plugin!
2020-03-03 07:33:50.061 3705 INFO neutron.
2020-03-03 07:33:50.065 3705 INFO neutron.
2020-03-03 07:33:50.153 3705 INFO neutron.
2020-03-03 07:33:50.271 3705 INFO neutron.
2020-03-03 07:33:50.382 3705 INFO neutron.
2020-03-03 07:33:50.383 3705 INFO neutron.
2020-03-03 07:33:50.385 3705 INFO neutron.
2020-03-03 07:33:50.463 3705 INFO neutron.
2020-03-03 07:33:50.581 3705 INFO neutron.
...
Most importantly: Physical bridge br-vlan was just re-created.
You will see the flows change, some have a new cookie and others the old cookie. But the drop flow on table 0 causes traffic to be dropped:
Every 1.0s: ovs-ofctl dump-flows br-vlan compute1: Tue Mar 3 07:46:16 2020
NXST_FLOW reply (xid=0x4): 0xfc7afcb358f79 36e, duration=2.522s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority= 2,in_port= 1 actions=drop 0xbe35f1e76f2f0 e27, duration=2.665s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=1 actions= resubmit( ,3) 0xfc7afcb358f79 36e, duration=2.525s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=0 actions=NORMAL 0xbe35f1e76f2f0 e27, duration=2.664s, table=1, n_packets=0, n_bytes=0, idle_age=2, priority=0 actions= resubmit( ,2) 0xfc7afcb358f79 36e, duration=2.451s, table=2, n_packets=0, n_bytes=0, idle_age=2, priority= 4,in_port= 1,dl_vlan= 1 actions= mod_vlan_ vid:1111, NORMAL 0xbe35f1e76f2f0 e27, duration=2.663s, table=2, n_packets=0, n_bytes=0, idle_age=2, priority= 2,in_port= 1 actions=drop 0xbe35f1e76f2f0 e27, duration=2.636s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority= 2,dl_src= fa:16:3f: 01:ad:70 actions=output:1 0xbe35f1e76f2f0 e27, duration=2.626s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority= 2,dl_src= fa:16:3f: 15:73:1b actions=output:1 0xbe35f1e76f2f0 e27, duration=2.621s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority= 2,dl_src= fa:16:3f: 49:67:3e actions=output:1 0xbe35f1e76f2f0 e27, duration=2.613s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority= 2,dl_src= fa:16:3f: b8:7d:b0 actions=output:1 0xbe35f1e76f2f0 e27, duration=2.661s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority=1 actions=NORMAL
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
The agent does not appear to re-implement the proper flows unless you restart the agent.
The only way I have been able to simulate this behavior is by killing the ovsdb-server or better yet, restarting the openvswitch-switch service without subsequently restarting the neutron ovs agent. In this prod environment I mentioned, the connection to :6640 was lost a couple of minutes after the neutron agent was restarted, which caused this 'drop' rule to be implemented until the agent was restarted. This behavior continue ad nauseam on all compute nodes in the environment until I patched the agent.
OVS Version: 2.11.0 openvswitch- agent version 14.0.5.dev19
Neutron Version: neutron-