Hello guys.
As I found in 2021 there was a commit to OVS [since OVS v2.16 or 2.15 with that backport] that changed behavior during OVN DBs snapshoting.
Now before the leader creates a snapshot it will transfer leadership to another node.
We've run tests with rally and tempest and looks like there is a problem now when there is interaction between nova and neutron.
For example, simple rally test like 'create {network,router,subnet} -> add interface to router' looks okay even with 256 concurrent same tests/threads. But something like 'neutron.create_subnet -> nova.boot_server -> nova.attach_interface' will fail in time when transfer leadership happens.
Since it happens to often [ each 10m + rand(10) ] we will get a lot of errors.
This problem can be observed on all versions where OVS 2.16 [or backport] or higher invited :)
Some tracing from logs [neutron, nova, ovn-sb-db]:
CONTROL-NODES:
ctl01-ovn-sb-db.log:2022-04-19T12:30:03.089Z|01002|raft|INFO|Transferring leadership to write a snapshot.
ctl01-ovn-sb-db.log:2022-04-19T12:30:03.099Z|01003|raft|INFO|server 1c5f is leader for term 42
ctl03-ovn-sb-db.log:2022-04-19T12:30:03.090Z|00938|raft|INFO|received leadership transfer from 1f46 in term 41
ctl03-ovn-sb-db.log:2022-04-19T12:30:03.092Z|00940|raft|INFO|term 42: elected leader by 2+ of 3 servers
ctl03-ovn-sb-db.log:2022-04-19T12:30:10.941Z|00941|jsonrpc|WARN|tcp:xx.yy.zz.26:41882: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.324Z|00943|jsonrpc|WARN|tcp:xx.yy.zz.26:41896: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00945|jsonrpc|WARN|tcp:xx.yy.zz.26:41880: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00947|jsonrpc|WARN|tcp:xx.yy.zz.26:41892: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.327Z|00949|jsonrpc|WARN|tcp:xx.yy.zz.26:41884: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00951|jsonrpc|WARN|tcp:xx.yy.zz.25:40260: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00953|jsonrpc|WARN|tcp:xx.yy.zz.25:40264: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00955|jsonrpc|WARN|tcp:xx.yy.zz.24:37440: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00957|jsonrpc|WARN|tcp:xx.yy.zz.24:37442: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.245Z|00959|jsonrpc|WARN|tcp:xx.yy.zz.24:37446: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:32:01.533Z|01001|jsonrpc|WARN|tcp:xx.yy.zz.67:57586: send error: Connection timed out
2022-04-19 12:30:08.898 27 INFO neutron.db.ovn_revision_numbers_db [req-7fcfdd74-482d-46b2-9f76-07190669d76d ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 1
2022-04-19 12:30:09.644 27 INFO neutron.db.ovn_revision_numbers_db [req-a8278418-3ad9-450c-89bb-e7a5c1c0a06d a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 - default default] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 2
2022-04-19 12:30:10.235 27 INFO neutron.wsgi [req-571b53cc-ca04-46f7-89f9-fdf8e5931f4c a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 - default default] xx.yy.zz.68,xx.yy.zz.26 "GET /v2.0/ports?tenant_id=9d3ae9a7b121488285203b0fdeabc3a3&device_id=7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 HTTP/1.1" status: 200 len: 1081 time: 0.0363808
2022-04-19 12:30:10.395 25 INFO nova.api.openstack.compute.server_external_events [req-d144458a-ca21-4e92-b5be-4f589e118d8c 4688fc44551b45c6a761c9c0351e76a8 72db34087b9b401d842b66643b647e16 - default default] Creating event network-changed:be178a9a-26d7-4bf0-a4e8-d206a6965205 for instance 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 on cmp-ucs-01-05.poc.example.com
2022-04-19 12:30:10.402 27 INFO neutron.notifiers.nova [-] Nova event response: {'name': 'network-changed', 'server_uuid': '7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34', 'tag': 'be178a9a-26d7-4bf0-a4e8-d206a6965205', 'status': 'completed', 'code': 200}
2022-04-19 12:33:01.497 34 INFO neutron.db.ovn_revision_numbers_db [req-82f3d4fe-1276-4918-86d3-55cc42ca2816 - - - - -] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 3
COMPUTE-NODES:
ovs-vswitchd.log:2022-04-19T12:30:10.743Z|08415|bridge|INFO|bridge br-int: added interface tapbe178a9a-26 on port 874
ovn-controller.log:2022-04-19T12:30:11.251Z|02356|binding|INFO|Claiming lport be178a9a-26d7-4bf0-a4e8-d206a6965205 for this chassis.
ovn-controller.log:2022-04-19T12:30:11.251Z|02357|binding|INFO|be178a9a-26d7-4bf0-a4e8-d206a6965205: Claiming fa:16:3e:8f:3b:47 1.80.1.254
nova-compute.log:2022-04-19 12:30:10.672 8 INFO os_vif [req-2c8b5f82-ce4a-4aa7-9f0a-d346ee210ffb ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:8f:3b:47,bridge_name='br-int',has_traffic_filtering=True,id=be178a9a-26d7-4bf0-a4e8-d206a6965205,network=Network(5dd2512b-56f8-4e04-8623-971a633a76aa),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapbe178a9a-26')
ovs-vswitchd.log:2022-04-19T12:35:11.302Z|08448|bridge|INFO|bridge br-int: deleted interface tapbe178a9a-26 on port 874
ovn-controller.log:2022-04-19T12:35:11.307Z|02380|binding|INFO|Releasing lport be178a9a-26d7-4bf0-a4e8-d206a6965205 from this chassis.
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [req-2c8b5f82-ce4a-4aa7-9f0a-d346ee210ffb ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Failed to
allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Traceback (most recent call last):
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7260, in _create_guest_with_network
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] post_xml_callback=post_xml_callback)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] next(self.gen)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] actual_event = event.wait()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] result = hub.switch()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] return self.greenlet.switch()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] eventlet.timeout.Timeout: 300 seconds
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34]
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] During handling of the above exception, another exception occurred:
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34]
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Traceback (most recent call last):
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2402, in _build_and_run_instance
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] accel_info=accel_info)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4225, in spawn
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] cleanup_instance_disks=created_disks)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7283, in _create_guest_with_network
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] raise exception.VirtualInterfaceCreateException()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34]
2022-04-19 12:35:11.844 8 ERROR nova.compute.manager [req-2c8b5f82-ce4a-4aa7-9f0a-d346ee210ffb ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Build of
instance 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 aborted: Failed to allocate the network(s), not rescheduling.: nova.exception.BuildAbortException: Build of instance 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 aborted: Failed to allocate the networ
k(s), not rescheduling.
CONTROL:
2022-04-19 12:35:11.330 25 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-2b4d10b1-3a7a-4683-97df-990259705b46 - - - - -] OVN reports status down for port: be178a9a-26d7-4bf0-a4e8-d206a6965205
2022-04-19 12:35:12.524 23 WARNING neutron.notifiers.nova [-] Nova event: {'server_uuid': '7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34', 'name': 'network-vif-deleted', 'tag': 'be178a9a-26d7-4bf0-a4e8-d206a6965205', 'status': 'failed', 'code': 422} returned with failed status
/var/log/kolla/neutron/neutron-server.log:2022-04-19 12:35:13.016 25 WARNING neutron.notifiers.nova [-] Nova event: {'server_uuid': '7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34', 'name': 'network-vif-plugged', 'status': 'failed', 'tag': 'be178a9a-26d7-4bf0-a4e8-d206a6965205', 'code': 422} returned with failed status
Nova compute waited for 300 sec to get a network-vif-plugged event then timed out:
2022-04-19 12:35:11.842 8 ERROR nova.compute. manager [instance: 7560fbb7- 3ec7-41ef- b7a5-5e955ca4ff 34] eventlet. timeout. Timeout: 300 seconds
Do you have a bit more logs from the nova-compute service? If you have debug log enabled then from the nova compute logs we can puzzle together what for what ports nova got the plugged event and for what port such event was missing.
I'm not sure but this might mean that neutron never sent the plugged event as the port was down in the OVN backend: plugins. ml2.drivers. ovn.mech_ driver. mech_driver [req-2b4d10b1- 3a7a-4683- 97df-990259705b 46 - - - - -] OVN reports status down for port: be178a9a- 26d7-4bf0- a4e8-d206a69652 05
2022-04-19 12:35:11.330 25 INFO neutron.
Somebody with neutron + OVN knowledge should look at this.