Comment 0 for bug 1896630

Revision history for this message
Drew Freiberger (afreiberger) wrote : ovn-chassis subordinate to octavia registered with shortname shows down

On a juju 2.7.8, latest charms (20.08), I have a dead ovn-controller agent on one of the octavia units.

$ openstack network agent list|grep lxd
| juju-a9d6f4-21-lxd-9.maas | OVN Controller agent | juju-a9d6f4-21-lxd-9 | | XXX | UP | ovn-controller |
| juju-a9d6f4-25-lxd-10.maas | OVN Controller agent | juju-a9d6f4-25-lxd-10.maas | | :-) | UP | ovn-controller |
| juju-a9d6f4-23-lxd-10.maas | OVN Controller agent | juju-a9d6f4-23-lxd-10.maas | | :-) | UP | ovn-controller |

Two of the three ovn-controller agents on octavia units are registered with host=$fqdn, and the down controller is registered with a shortname.

`hostname -f` shows the full fqdn on the down unit
/etc/openvswitch/system-id.conf lists the short hostname only
`ovs-vsctl list open_vswitch` lists both the hostname and the system-id as shortname

seeing a lot of errors in /var/log/ovn/ovn-controller.log along the lines of:
2020-09-22T14:22:39.500Z|04678|binding|INFO|Changing chassis for lport 529233fc-f9c4-40b1-8c6a-f2e906a2498d from juju-a9d6f4-21-lxd-9.maas to juju-a9d6f4-21-lxd-9.
2020-09-22T06:25:01.829Z|857112|main|INFO|OVNSB commit failed, force recompute next time.

restart of ovn-controller shows the following in the log:
2020-09-22T14:22:30.498Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-controller.log
2020-09-22T14:22:30.500Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-09-22T14:22:30.500Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-09-22T14:22:30.502Z|00004|main|INFO|OVS IDL reconnected, force recompute.
2020-09-22T14:22:30.504Z|00005|reconnect|INFO|ssl:10.35.61.157:6642: connecting...
2020-09-22T14:22:30.504Z|00006|main|INFO|OVNSB IDL reconnected, force recompute.
2020-09-22T14:22:30.508Z|00007|reconnect|INFO|ssl:10.35.61.157:6642: connected
2020-09-22T14:22:30.514Z|00008|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2020-09-22T14:22:30.514Z|00009|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2020-09-22T14:22:30.514Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2020-09-22T14:22:30.515Z|00011|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"juju-a9d6f4-21-lxd-9\" role \"ovn-controller\" prohibit modification of table \"Chassis\".","error":"permission error"}
2020-09-22T14:22:30.515Z|00012|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.515Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2020-09-22T14:22:30.515Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2020-09-22T14:22:30.516Z|00013|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 86556077-6325-4cb6-9bbd-c5979ae15d2c, was inserted by this transaction. Second row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}
2020-09-22T14:22:30.516Z|00014|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.516Z|00015|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 916635aa-e98c-4f23-8ac8-1e3f381151c6, was inserted by this transaction.","error":"constraint violation"}
2020-09-22T14:22:30.516Z|00016|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.516Z|00017|binding|INFO|Changing chassis for lport 529233fc-f9c4-40b1-8c6a-f2e906a2498d from juju-a9d6f4-21-lxd-9.maas to juju-a9d6f4-21-lxd-9.
2020-09-22T14:22:30.516Z|00018|binding|INFO|529233fc-f9c4-40b1-8c6a-f2e906a2498d: Claiming fa:16:3e:e4:70:66 fc00:2d33:a2bc:84d4:f816:3eff:fee4:7066
2020-09-22T14:22:30.517Z|00019|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 6219b9c9-fc57-4caa-8f75-46ead7584901, was inserted by this transaction.","error":"constraint violation"}
2020-09-22T14:22:30.517Z|00020|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.518Z|00021|binding|INFO|Changing chassis for lport 529233fc-f9c4-40b1-8c6a-f2e906a2498d from juju-a9d6f4-21-lxd-9.maas to juju-a9d6f4-21-lxd-9.
2020-09-22T14:22:30.518Z|00022|binding|INFO|529233fc-f9c4-40b1-8c6a-f2e906a2498d: Claiming fa:16:3e:e4:70:66 fc00:2d33:a2bc:84d4:f816:3eff:fee4:7066
2020-09-22T14:22:30.521Z|00023|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 5f2ca07b-859f-4013-9e49-5fd00a1909e9, was inserted by this transaction.","error":"constraint violation"}
2020-09-22T14:22:30.521Z|00024|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.521Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected

Relation info being provided from octavia-ovn-chassis to octavia on that unit shows chassis-name as the short hostname, but on other octavia units, the chassis-name provided from ovn-chassis to octavia is the fqdn.

$ sudo juju-run octavia/0 -r 139 --remote-unit octavia-ovn-chassis/1 'relation-get'
chassis-name: '"juju-a9d6f4-21-lxd-9"'
egress-subnets: 10.35.61.179/32
ingress-address: 10.35.61.179
ovn-configured: "true"
private-address: 10.35.61.179

$ sudo juju-run octavia/1 -r 139 --remote-unit octavia-ovn-chassis/2 'relation-get'
chassis-name: '"juju-a9d6f4-23-lxd-10.maas"'
egress-subnets: 10.35.61.191/32
ingress-address: 10.35.61.191
ovn-configured: "true"
private-address: 10.35.61.191

It appears from a brief read-through of the ovn-chassis charm that the hostname is queried from the ovsdb and then system-id is set from that hostname. Is it possible that there's a race between the system being able to query it's fqdn from DNS during deployment and the hostname ovs sees when it initializes the database on install?

Some potentially relevant code snippets:
        # The local ``ovn-controller`` process will retrieve information about
        # how to connect to OVN from the local Open vSwitch database.
        self.run('ovs-vsctl',
                 'set', 'open', '.',
                 'external-ids:ovn-encap-type=geneve', '--',
                 'set', 'open', '.',
                 'external-ids:ovn-encap-ip={}'
                 .format(self.get_data_ip()), '--',
                 'set', 'open', '.',
                 'external-ids:system-id={}'
                 .format(self.get_ovs_hostname()))
*snip*
    def get_ovs_hostname():
        for row in ch_ovsdb.SimpleOVSDB('ovs-vsctl').open_vswitch:
            return row['external_ids']['hostname']