Chassis name confusion after restart?

Bug #1925793 reported by Jake Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-chassis
New
Undecided
Unassigned

Bug Description

There seems to be some randomness as to if we are "hostname" or "hostname.maas" following a forced restart.

The attached log is from an octavia unit but I have seen the same on compute nodes. This one was freshly installed and then rebooted at 2021-04-23T11:39 ish.

I can resolve this by doing;

juju run --unit=octavia/0 -- sudo systemctl restart ovn-host
juju run --unit=octavia/0 -- sudo systemctl restart ovn-controller
juju run --unit=octavia/0 -- sudo systemctl restart ovsdb-server

but it seem to be something of a race condition.

Revision history for this message
Jake Hill (routergod) wrote :
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Hello, Jake,

Thank you for the bug report.

I believe this should be taken care of by the fix to Open vSwitch bug 1915829.

Would you be able to check if you're at the version with that fix and/or try the -proposed package and see if the problem goes away?

Changed in charm-ovn-chassis:
status: New → Incomplete
Revision history for this message
Jake Hill (routergod) wrote :
Download full text (3.4 KiB)

Many thanks. I added the -proposed repo and upgraded. I think this probably fixed a bunch of things so a bit hard to disentangle. Now running openvswitch 2.13.3-0ubuntu0.20.04.1.

With some repeated reboots I sometimes still end up with log full of these messages;

2021-04-23T13:14:42.868Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-controller.log
2021-04-23T13:14:42.872Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2021-04-23T13:14:42.872Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2021-04-23T13:14:42.875Z|00004|main|INFO|OVS IDL reconnected, force recompute.
2021-04-23T13:14:42.877Z|00005|reconnect|INFO|ssl:10.23.129.156:6642: connecting...
2021-04-23T13:14:42.877Z|00006|main|INFO|OVNSB IDL reconnected, force recompute.
2021-04-23T13:14:42.883Z|00007|reconnect|INFO|ssl:10.23.129.156:6642: connected
2021-04-23T13:14:42.886Z|00008|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2021-04-23T13:14:42.886Z|00009|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-23T13:14:42.887Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-23T13:14:42.887Z|00011|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"juju-3eccf9-2-lxd-3\" role \"ovn-controller\" prohibit row insertion into table \"Chassis\".","error":"permission error"}
2021-04-23T13:14:42.888Z|00012|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T13:14:42.888Z|00001|pinctrl(ovn_pinctrl2)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2021-04-23T13:14:42.888Z|00002|rconn(ovn_pinctrl2)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-23T13:14:42.890Z|00013|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"juju-3eccf9-2-lxd-3\" role \"ovn-controller\" prohibit row insertion into table \"Chassis\".","error":"permission error"}
2021-04-23T13:14:42.890Z|00014|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T13:14:42.890Z|00003|rconn(ovn_pinctrl2)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-23T13:14:42.891Z|00015|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"juju-3eccf9-2-lxd-3\" role \"ovn-controller\" prohibit row insertion into table \"Chassis\".","error":"permission error"}
2021-04-23T13:14:42.892Z|00016|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T13:14:42.892Z|00017|binding|INFO|Claiming lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a for this chassis.
2021-04-23T13:14:42.892Z|00018|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming fa:16:3e:4d:10:97 fc00:69f7:40ec:ade7:f816:3eff:fe4d:1097
2021-04-23T13:14:42.892Z|00019|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown
2021-04-23T13:14:42.892Z|00020|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"juju-3eccf9-2-lxd-3\" role \"ovn-controller\" prohibit row insertion into table \"Chassis\".","error":"permission error"}
2021-04-23T13:14:42.892Z|00021|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T13:14:42.896Z|00022|binding|INFO|Claiming lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a for this...

Read more...

Revision history for this message
Frode Nordahl (fnordahl) wrote :

The point release update to Open vSwitch contains two changes, one which changes the init script to not overwrite an existing external-ids:hostname record, and one that makes sure the initial recording happens after the network is online.

So with both of these changes in I'm a bit surprised about the outcome. Do you know what the value was set to prior to upgrade+reboot?

You can check this with `sudo ovs-vsctl get open-vswitch . external-ids:hostname`

Would you be able to try to reset it and perform a new reboot and see what happens? This can be accomplished with `sudo ovs-vsctl remove open-vswitch . external-ids hostname`

Revision history for this message
Jake Hill (routergod) wrote :

The setting seems consistent;

$ juju ssh octavia/0 -- sudo ovs-vsctl get open-vswitch . external-ids:hostname
juju-3eccf9-2-lxd-3.maas

$ juju ssh octavia/0 -- sudo ovs-vsctl remove open-vswitch . external-ids hostname
Connection to 10.23.129.155 closed.

$ juju ssh octavia/0 -- sudo reboot

$ juju ssh octavia/0 -- sudo ovs-vsctl get open-vswitch . external-ids:hostname
juju-3eccf9-2-lxd-3.maas

The controller log output is subtly different to before the upgrade. I appear not to be getting the messages like;

"Changing chassis for lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a from juju-3eccf9-2-lxd-3.maas to juju-3eccf9-2-lxd-3."

But for some reason the DB access control appears to fail randomly at startup (in the below first working then not working);

$ juju ssh octavia/0 -- sudo tail /var/log/ovn/ovn-controller.log
2021-04-23T15:18:54.601Z|00007|reconnect|INFO|ssl:10.23.129.139:6642: connected
2021-04-23T15:18:54.613Z|00008|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2021-04-23T15:18:54.613Z|00009|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-23T15:18:54.614Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-23T15:18:54.617Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2021-04-23T15:18:54.617Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-23T15:18:54.618Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2021-04-23T15:18:54.624Z|00011|binding|INFO|Claiming lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a for this chassis.
2021-04-23T15:18:54.624Z|00012|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming fa:16:3e:4d:10:97 fc00:69f7:40ec:ade7:f816:3eff:fe4d:1097
2021-04-23T15:18:54.624Z|00013|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown

$ juju ssh octavia/0 -- sudo reboot
Connection to 10.23.129.155 closed by remote host.

$ juju ssh octavia/0 -- sudo tail /var/log/ovn/ovn-controller.log
2021-04-23T15:19:43.188Z|00165|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown
2021-04-23T15:19:43.193Z|00166|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T15:19:43.193Z|00167|binding|INFO|Claiming lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a for this chassis.
2021-04-23T15:19:43.193Z|00168|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming fa:16:3e:4d:10:97 fc00:69f7:40ec:ade7:f816:3eff:fe4d:1097
2021-04-23T15:19:43.193Z|00169|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown
2021-04-23T15:19:43.198Z|00170|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T15:19:43.198Z|00171|binding|INFO|Claiming lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a for this chassis.
2021-04-23T15:19:43.198Z|00172|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming fa:16:3e:4d:10:97 fc00:69f7:40ec:ade7:f816:3eff:fe4d:1097
2021-04-23T15:19:43.198Z|00173|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown
2021-04-23T15:19:43.203Z|00174|main|INFO|OVNSB commit failed, force recompute next time.
Connection to 10.23.129.155 closed.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Great, so that then looks like it fixed the issue you are reporting originally.

So since this is involving OVN and Octavia I'm inclined it might be related to bug 1917475, that issue and several other RBAC-related issues will be fixed by the point release update tracked in bug 1924981.

The first bug includes a reference to a PPA with fixed OVN packages, would you be able to try adding that PPA to the deployment and see if it helps with the issue? You would have to add it to the ovn-central units and possibly ovn-chassis units.

Revision history for this message
Jake Hill (routergod) wrote :

Ah sorry I didn't appreciate there might be two bugs :-)

Not sure if this is right, but what I have after adding the PPA:

$ juju ssh ovn-central/0 -- sudo apt list | grep -e ovn-central -e ovn-common
ovn-central/focal,now 20.03.1-0ubuntu1.2.0 amd64 [installed]
ovn-common/focal,now 20.03.1-0ubuntu1.2.0 amd64 [installed,automatic]

$ juju ssh octavia/0 -- sudo apt list | grep -e ovn-common -e ovn-host
ovn-common/focal,now 20.03.1-0ubuntu1.2.0 amd64 [installed,automatic]
ovn-host/focal,now 20.03.1-0ubuntu1.2.0 amd64 [installed]

Unfortunately I appear still to have the issue. At least a growing log of;

2021-04-23T17:35:55.708Z|324798|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown
2021-04-23T17:35:55.715Z|324799|main|INFO|OVNSB commit failed, force recompute next time.
2021-04-23T17:35:55.715Z|324800|binding|INFO|Claiming lport bdc6da30-c675-4c3f-a0ee-870c8074fb0a for this chassis.
2021-04-23T17:35:55.715Z|324801|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming fa:16:3e:4d:10:97 fc00:69f7:40ec:ade7:f816:3eff:fe4d:1097
2021-04-23T17:35:55.715Z|324802|binding|INFO|bdc6da30-c675-4c3f-a0ee-870c8074fb0a: Claiming unknown
2021-04-23T17:35:55.720Z|324803|main|INFO|OVNSB commit failed, force recompute next time.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Ha, peeling layers everywhere you know ;)

One more thing to check is if the external-id:system-id matches the external-id:hostname, also check if this is consistent with the CN in the certificate in /etc/ovn. Also check if there is not a mismatch between chassis name and hostname in the SB dB.

If not any of the above, I'm a bit lost as to what this is. Make sure the SB an NB dB components restarted on all units, also make sure the ovn-northd process restarted.

Next step would be to increase debug logging on ovn controller to figure out exactly what it's choking on.

Looking at the ovn SB dB log and/or increasing debug levels on that side may also be helpful.

This accomplished with ovn-appctl and the vlog/set commands.

Revision history for this message
Jake Hill (routergod) wrote :

Thank you Frode. It seems that /etc/ovn/cert_host on the offending unit had the wrong (non-.maas) name, presumably from when I deployed. I did reissue-certificates in vault and restarted the OVN. This has cleared the error.

Many thanks for your help and suggestions.

Revision history for this message
Jake Hill (routergod) wrote :

Further to this, I have redeployed the model using openstack-origin=proposed and find that the OVN certificates seem to get the correct names this way. This appears to resolve all the issues I was having previously. Problem solved it seems.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Thank you for providing feedback with and documenting your success with the in-flight fixes.

I'll go ahead and de-triage this bug and mark it a duplicate of the Open vSwitch one, as that was the problem you originally filed it for.

Changed in charm-ovn-chassis:
status: Incomplete → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.