OpenStack Nova Compute Charm

Need for managing /etc/hosts for containers

Bug #1896630 reported by Drew Freiberger on 2020-09-22

This bug affects 14 people

	Status	Importance	Assigned to
Canonical Juju	Triaged	Low	Unassigned
OpenStack Charm Guide	Fix Released	Undecided	Felipe Reyes
OpenStack Nova Compute Charm	Fix Committed	Undecided	Felipe Reyes
charm-layer-ovn	Invalid	High	Unassigned

Bug Description

When deploying on metal with MAAS, MAAS will add the FQDN to the localhost record in /etc/hosts so that issuing the `hostname -f` command will always succeed regardless of availability of the network.

When deploying on the other provider combinations it is Juju that does the host initialization and Juju does not add the FQDN to the localhost record in /etc/hosts.

[Original description]
On a juju 2.7.8, latest charms (20.08), I have a dead ovn-controller agent on one of the octavia units.

Two of the three ovn-controller agents on octavia units are registered with host=$fqdn, and the down controller is registered with a shortname.

`hostname -f` shows the full fqdn on the down unit
/etc/openvswitch/system-id.conf lists the short hostname only
`ovs-vsctl list open_vswitch` lists both the hostname and the system-id as shortname

restart of ovn-controller shows the following in the log:
2020-09-22T14:22:30.498Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-controller.log
2020-09-22T14:22:30.500Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-09-22T14:22:30.500Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-09-22T14:22:30.502Z|00004|main|INFO|OVS IDL reconnected, force recompute.
2020-09-22T14:22:30.504Z|00005|reconnect|INFO|ssl:10.35.61.157:6642: connecting...
2020-09-22T14:22:30.504Z|00006|main|INFO|OVNSB IDL reconnected, force recompute.
2020-09-22T14:22:30.508Z|00007|reconnect|INFO|ssl:10.35.61.157:6642: connected
2020-09-22T14:22:30.514Z|00008|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2020-09-22T14:22:30.514Z|00009|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2020-09-22T14:22:30.514Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2020-09-22T14:22:30.515Z|00011|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"juju-a9d6f4-21-lxd-9\" role \"ovn-controller\" prohibit modification of table \"Chassis\".","error":"permission error"}
2020-09-22T14:22:30.515Z|00012|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.515Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2020-09-22T14:22:30.515Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2020-09-22T14:22:30.516Z|00013|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 86556077-6325-4cb6-9bbd-c5979ae15d2c, was inserted by this transaction. Second row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}
2020-09-22T14:22:30.516Z|00014|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.516Z|00015|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 916635aa-e98c-4f23-8ac8-1e3f381151c6, was inserted by this transaction.","error":"constraint violation"}
2020-09-22T14:22:30.516Z|00016|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.516Z|00017|binding|INFO|Changing chassis for lport 529233fc-f9c4-40b1-8c6a-f2e906a2498d from juju-a9d6f4-21-lxd-9.maas to juju-a9d6f4-21-lxd-9.
2020-09-22T14:22:30.516Z|00018|binding|INFO|529233fc-f9c4-40b1-8c6a-f2e906a2498d: Claiming fa:16:3e:e4:70:66 fc00:2d33:a2bc:84d4:f816:3eff:fee4:7066
2020-09-22T14:22:30.517Z|00019|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 6219b9c9-fc57-4caa-8f75-46ead7584901, was inserted by this transaction.","error":"constraint violation"}
2020-09-22T14:22:30.517Z|00020|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.518Z|00021|binding|INFO|Changing chassis for lport 529233fc-f9c4-40b1-8c6a-f2e906a2498d from juju-a9d6f4-21-lxd-9.maas to juju-a9d6f4-21-lxd-9.
2020-09-22T14:22:30.518Z|00022|binding|INFO|529233fc-f9c4-40b1-8c6a-f2e906a2498d: Claiming fa:16:3e:e4:70:66 fc00:2d33:a2bc:84d4:f816:3eff:fee4:7066
2020-09-22T14:22:30.521Z|00023|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.35.82.18\") for index on columns \"type\" and \"ip\". First row, with UUID 3345a08e-534b-4ccf-a7b6-2d6d00706422, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 5f2ca07b-859f-4013-9e49-5fd00a1909e9, was inserted by this transaction.","error":"constraint violation"}
2020-09-22T14:22:30.521Z|00024|main|INFO|OVNSB commit failed, force recompute next time.
2020-09-22T14:22:30.521Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected

Relation info being provided from octavia-ovn-chassis to octavia on that unit shows chassis-name as the short hostname, but on other octavia units, the chassis-name provided from ovn-chassis to octavia is the fqdn.

$ sudo juju-run octavia/0 -r 139 --remote-unit octavia-ovn-chassis/1 'relation-get'
chassis-name: '"juju-a9d6f4-21-lxd-9"'
egress-subnets: 10.35.61.179/32
ingress-address: 10.35.61.179
ovn-configured: "true"
private-address: 10.35.61.179

$ sudo juju-run octavia/1 -r 139 --remote-unit octavia-ovn-chassis/2 'relation-get'
chassis-name: '"juju-a9d6f4-23-lxd-10.maas"'
egress-subnets: 10.35.61.191/32
ingress-address: 10.35.61.191
ovn-configured: "true"
private-address: 10.35.61.191

It appears from a brief read-through of the ovn-chassis charm that the hostname is queried from the ovsdb and then system-id is set from that hostname. Is it possible that there's a race between the system being able to query it's fqdn from DNS during deployment and the hostname ovs sees when it initializes the database on install?

Some potentially relevant code snippets:
        # The local ``ovn-controller`` process will retrieve information about
        # how to connect to OVN from the local Open vSwitch database.
        self.run('ovs-vsctl',
                 'set', 'open', '.',
                 'external-ids:ovn-encap-type=geneve', '--',
                 'set', 'open', '.',
                 'external-ids:ovn-encap-ip={}'
                 .format(self.get_data_ip()), '--',
                 'set', 'open', '.',
                 'external-ids:system-id={}'
                 .format(self.get_ovs_hostname()))
*snip*
    def get_ovs_hostname():
        for row in ch_ovsdb.SimpleOVSDB('ovs-vsctl').open_vswitch:
            return row['external_ids']['hostname']

See original description

Tags:

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-09-22:

assigning field-critical. This is blocking go-live for a Bootstack customer.

Thinking about re-deploying the octavia node and seeing if that clears the issue.

Frode confirmed that the hostname originally comes from hostname -f run by the ovs startup script, and this is likely a race condition at unit deployment time.

Frode Nordahl (fnordahl) on 2020-09-22

no longer affects:

charm-neutron-openvswitch

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-09-22:

Reduced severity to field-high. Workaround:

deploy new octavia unit (assumes an HA setup)
remove octavia unit hosting the failing ovn-controller
login to ovn-central hosting the leader for the southbound database
ovn-sbctl show |grep 21-lxd-9
ovn-sbctl chassis-del <chassis name>
logout
juju run -a neutron-api 'service neutron-server restart'

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2020-11-20:

An alternative approach to working around the issue:

juju run --application octavia 'ovs-vsctl set open_vswitch . external_ids:hostname=$(hostname -f)'
juju remove-relation octavia-ovn-chassis vault
[ wait for realtion to be removed ]
juju add-relation octavia-ovn-chassis vault
juju run --application octavia-ovn-chassis hooks/config-changed
juju run --application octavia-ovn-chassis 'systemctl restart ovn-controller'
juju run --application octavia hooks/config-changed
for port in $(openstack port list|awk '/octavia-health-manager-octavia-.-listen-port/{print$2}'); do openstack port set --disable $port;done
for port in $(openstack port list|awk '/octavia-health-manager-octavia-.-listen-port/{print$2}'); do openstack port set --enable $port;done

Confirm that chassis has registered itself and claimed the port in /var/log/ovn/ovn-controller.log on the octavia units.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2020-12-02:

Looking a bit at how this behaves on different providers:
- OpenStack: we do not see the issue
- OpenStack+LXD (fan networking): unknown
- MAAS: we do not see the issue
- MAAS+LXD: we consistently see the issue

When deploying on the other provider combinations it is Juju that does the host initialization and Juju does not add the FQDN to the localhost record in /etc/hosts.

Possible causes for the change of behaviour:
A change in LXD and/or Linux bridge configuration pertaining availability of network in the early boot cycle of the instance?

A change in MAAS pertaining how/when the DNS records for the LXD container are populated?

Possible long term solutions:
Should Juju populate /etc/hosts with a FQDN for the localhost record the same way MAAS does for on-metal deployments?

Frode Nordahl (fnordahl) on 2020-12-10

tags:

added: ps5

Billy Olsen (billy-olsen) on 2021-01-13

Changed in charm-layer-ovn:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-01-14:

Curiously this does not occur when deploying with MAAS+LXD when the underlying machine is a virtual machine as opposed to a physical bare metal one.

That could suggest that the issue indeed is changes in how things are wired/bridged out of the host and/or suggest environmental network issues such as STP.

I see that cloud-init has the capability of managing /etc/hosts already through the `manage_etc_hosts` config key.

I wonder if we could influence this from the charm LXD profile through the use of the `user.vendor-data` knob while awaiting a way to have juju put this into the user-data for us.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-01-14:

Adding this to charm-ovn-chassis:
$ git diff
diff --git a/src/lxd-profile.yaml b/src/lxd-profile.yaml
index 044e653..7114b0f 100644
--- a/src/lxd-profile.yaml
+++ b/src/lxd-profile.yaml
@@ -1,2 +1,5 @@
config:
linux.kernel_modules: openvswitch
+ user.vendor-data: |
+ #cloud-config
+ manage_etc_hosts: true

Does create a profile with the correct values. However, the profile is applied too late by Juju so it does not have any effect. I guess they have made the kernel module loading work ad-hoc for subordinate profiles without taking other keys into account.

Adding it to the charm-octavia lxd-profile.yaml does get it applied, but unfortunately does also not solve the problem. LXD only provides the container name in the cloud-init NoCloud seed meta-data.

So we're back to square one in Juju needing to provide the fqdn it knows from MAAS when creating the container.

We do not have access to these knobs dynamically from charms.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-01-14:

As a PoC I added this to octavia lxd-profile.yaml:
$ cat lxd-profile.yaml
config:
  user.vendor-data: |
    #cloud-config
    manage_etc_hosts: true
    fqdn: juju-fb8c1c-0-lxd-7.maas

And deployed the augmented charm to a new application with placement so that I knew the hard coded fqdn would match what Juju would actually set up.

This resulted in the following on the unit:
ubuntu@juju-fb8c1c-0-lxd-7:~$ cat /var/lib/cloud/seed/nocloud-net/vendor-data
#cloud-config
manage_etc_hosts: true
fqdn: juju-fb8c1c-0-lxd-7.maas
ubuntu@juju-fb8c1c-0-lxd-7:~$ grep juju /etc/hosts
127.0.1.1 juju-fb8c1c-0-lxd-7.maas juju-fb8c1c-0-lxd-7
ubuntu@juju-fb8c1c-0-lxd-7:~$ hostname -f
juju-fb8c1c-0-lxd-7.maas

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-01-14:

Added Juju to this bug report, as we need help from them to resolve this issue.

The executive summary is that when Juju deploys to LXD containers on MAAS using physical machines, the LXD container is not informed of it's FQDN.

This results in services running in the container not being able to determine their FQDN on initial deploy nor on reboot.

For reference MAAS does inform cloud-init of a machines FQDN and instruct cloud-init to manage /etc/hosts on the physical machine.

We would like Juju to do this for the LXD containers, comment #7 contains information about how this could be accomplished (although I gather Juju would add this to the user-data and not use vendor-data).

It is not possible to do this right from a charm.

Revision history for this message

Hybrid512 (walid-moghrabi) wrote on 2021-01-15:

Download full text (11.9 KiB)

Hi,

I don't have the same behavior but it might be related.

------------------------------------------------------------------
openstack port list | grep octavia
| 6fd3e411-0f55-4f57-a49b-8978ac7045be | octavia-health-manager-octavia-0-listen-port | fa:16:3e:4b:c6:48 | ip_address='fc00:bee5:427a:2b79:f816:3eff:fe4b:c648', subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN |
| 8878130c-c8a6-44f1-a668-d669b00a8e0d | octavia-health-manager-octavia-2-listen-port | fa:16:3e:bd:2c:83 | ip_address='fc00:bee5:427a:2b79:f816:3eff:febd:2c83', subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN |
| fdce70ed-b861-4473-a483-2024b2733c75 | octavia-health-manager-octavia-1-listen-port | fa:16:3e:a0:1d:a2 | ip_address='fc00:bee5:427a:2b79:f816:3eff:fea0:1da2', subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN |
------------------------------------------------------------------

Hi,

I don't have the same behavior but it might be related.

------------------------------------------------------------------
openstack port list | grep octavia
| 6fd3e411-0f55-4f57-a49b-8978ac7045be | octavia-health-manager-octavia-0-listen-port | fa:16:3e:4b:c6:48 | ip_address='fc00:bee5:427a:2b79:f816:3eff:fe4b:c648', subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN   |
| 8878130c-c8a6-44f1-a668-d669b00a8e0d | octavia-health-manager-octavia-2-listen-port | fa:16:3e:bd:2c:83 | ip_address='fc00:bee5:427a:2b79:f816:3eff:febd:2c83', subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN   |
| fdce70ed-b861-4473-a483-2024b2733c75 | octavia-health-manager-octavia-1-listen-port | fa:16:3e:a0:1d:a2 | ip_address='fc00:bee5:427a:2b79:f816:3eff:fea0:1da2', subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN   |
------------------------------------------------------------------

------------------------------------------------------------------
hostname -f
juju-37c2ba-2-lxd-16.maas
------------------------------------------------------------------

------------------------------------------------------------------
cat /etc/hosts
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
------------------------------------------------------------------

Take a look at the Neutron ports created by the Octavia charm, for example `octavia-health-manager-octavia-0-listen-port`:
- Does the `binding_host_id` match the FQDN of the Octavia container?

As you can see, this is not the case, FQDN is "juju-37c2ba-2-lxd-16.maas" while binding_host_id is  "juju-37c2ba-2-lxd-16" (shortname)

- Does the `binding_vif_type` field say 'ovs' or does it say 'binding_failed'?

As you can see, it says "binding_failed"

Here is a snipet of my /var/log/ovn/ovn-controller.log :

and it continues to the end of the file with the last 2 lines ("not claiming ...", "Dropped xx log messages ..."), nothing else.

So, to summarize, you're saying that the issue is due to the fact that binding_host_id is not using the FQDN ?
Would adding an entry in /etc/hosts would fix the issue ?

Thanks for your help.

Best regards,

Walid

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-01-15:

#10

Thank you for providing more information, as you can see in the messages logged from ovn-controller and the port details you are having exactly the same issue:
| binding_host_id | juju-37c2ba-2-lxd-16

The ovn-controller system-id and port binding_host_id should have been set to the FQDN of the octavia unit, but as explained above the container is not being able to establish its FQDN at initial deploy or subsequent reboots.

You may be able to work around the issue at run time by applying the commands in comment #3, and could apply the workaround in comment #7 to the octavia LXD containenr config/profile to retain operation after restart of the containers.

Revision history for this message

Pen Gale (pengale) wrote on 2021-01-28:

#11

After some discussion with the core team, I have a question: is the fqdn something that we are going to be able to rely on outside of MAAS? If not, are we sure that we want to be relying on it?

I understand the convenience here. But we generally want Juju and charms to work seamlessly across cloud substrates, and it feels like any fixes here would fix things for OpenStack, but not for other clouds.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-01-29:

#12

Would you want to put every instance in your large scale infrastructure in the same namespace/subdomain? Probably not. Do the public clouds provide your instance with a FQDN? They do. Do you want to rely on IP addresses only to have to change everything in case of IP renumbering? Absolutely not.

This is a generic on-metal problem, and it won't go away regardless of what Juju is putting on metal.

If you want to detect and tackle situations where LXD containers come up without networking in a different way that would also work for us. But providing the instance with the data it needs is a solid and durable fix, and there is a reason for MAAS putting the FQDN in /etc/hosts on the physical machine.

While the desire to treat all the provider drivers equally is noble, they are quite different already, and do you really want to dilute Juju's on-metal capabilities with a least common denominator approach?

Would forwarding FQDN to cloud-init user-data when we know the provider manages it be possible?

The making cloud-init use it part could be up to the charm author to set through a lxd-profile so that you don't risk a change of default behavior situation.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2021-02-01:

#13

Hum, this may be a catch-22 in the service ordering, looking closer at console while a affected container boots:
[ OK ] Reached target Network (Pre).
         Starting Open vSwitch Database Unit...
         Starting Network Service...
[ OK ] Started Open vSwitch Database Unit.
         Starting OpenVSwitch configuration for cleanup...
         Starting Open vSwitch Forwarding Unit...
[ OK ] Started Open vSwitch Forwarding Unit.
[ OK ] Finished OpenVSwitch configuration for cleanup.
[ OK ] Started Network Service.
         Starting Wait for Network to be Configured...
         Starting Network Name Resolution...
[ OK ] Finished Wait for Network to be Configured.
         Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.

Looking at output from journalctl -u:
Feb 01 15:16:59 juju-b10aeb-6-lxd-7 ovs-ctl[419]: hostname: Temporary failure in name resolution

root@juju-b10aeb-6-lxd-7:/home/ubuntu# stat /run/systemd/network/10-netplan-eth2.*
  File: /run/systemd/network/10-netplan-eth2.link
  Size: 73 Blocks: 8 IO Block: 4096 regular file
Device: c0h/192d Inode: 34 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-02-01 15:26:34.294834047 +0000
Modify: 2021-02-01 15:22:29.621531946 +0000
Change: 2021-02-01 15:22:29.621531946 +0000
Birth: -
  File: /run/systemd/network/10-netplan-eth2.network
  Size: 191 Blocks: 8 IO Block: 4096 regular file
Device: c0h/192d Inode: 35 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-02-01 15:26:39.582862189 +0000
Modify: 2021-02-01 15:22:29.621531946 +0000
Change: 2021-02-01 15:22:29.621531946 +0000
Birth: -

With the addition of support for OVS in netplan I assume we also need OVS to run prior to netplan running, which kind of creates a cyclic dependency wrt. Open vSwitch's init script updating its database with FQDN from call to `hostname -f` each boot.

This is hidden on the physical host as MAAS / cloud-init manages /etc/hosts

Hum, this may be a catch-22 in the service ordering, looking closer at console while a affected container boots:
[  OK  ] Reached target Network (Pre).
         Starting Open vSwitch Database Unit...
         Starting Network Service...
[  OK  ] Started Open vSwitch Database Unit.
         Starting OpenVSwitch configuration for cleanup...
         Starting Open vSwitch Forwarding Unit...
[  OK  ] Started Open vSwitch Forwarding Unit.
[  OK  ] Finished OpenVSwitch configuration for cleanup.
[  OK  ] Started Network Service.
         Starting Wait for Network to be Configured...
         Starting Network Name Resolution...
[  OK  ] Finished Wait for Network to be Configured.
         Starting Initial cloud-init job (metadata service crawler)...
[  OK  ] Started Network Name Resolution.
[  OK  ] Reached target Host and Network Name Lookups.

Looking at output from journalctl -u:
Feb 01 15:16:59 juju-b10aeb-6-lxd-7 ovs-ctl[419]: hostname: Temporary failure in name resolution

root@juju-b10aeb-6-lxd-7:/home/ubuntu# stat /run/systemd/network/10-netplan-eth2.*
  File: /run/systemd/network/10-netplan-eth2.link
  Size: 73        	Blocks: 8          IO Block: 4096   regular file
Device: c0h/192d	Inode: 34          Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-02-01 15:26:34.294834047 +0000
Modify: 2021-02-01 15:22:29.621531946 +0000
Change: 2021-02-01 15:22:29.621531946 +0000
 Birth: -
  File: /run/systemd/network/10-netplan-eth2.network
  Size: 191       	Blocks: 8          IO Block: 4096   regular file
Device: c0h/192d	Inode: 35          Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-02-01 15:26:39.582862189 +0000
Modify: 2021-02-01 15:22:29.621531946 +0000
Change: 2021-02-01 15:22:29.621531946 +0000
 Birth: -

This is hidden on the physical host as MAAS / cloud-init manages /etc/hosts

Revision history for this message

Jeff Hillman (jhillman) wrote on 2021-02-02:

#14

I'm seeing something similar, but not identical. not sure if it is caused by the same issue or if a new bug needs to be created.

In my scenario, on a new deploy, when I run 'openstack network agent list | grep lxd', they all show the proper longs names.

Also, running 'juju run -a octavia 'ovs-vsctl get . external_ids:hostname'' also all show the proper short names.

However, running 'openstack port list | grep octavia', shows one as being DOWN.

Performing a port show on the health manager has it as binding_failed AND also shows the shortname.

Performing Frode's steps in comment #3 above, fixes the shortname, and the binding, however, the status is still DOWN, and I cannot ping any of the amphora images, or the other octavia ports from this bad octavia container.

Going into the container and running 'ip link show' has the link at UNKNOWN. Running 'ip link set down dev o-hm0; ip link set up dev o-hm0' puts the port back into an UNKNOWN state, but pings work. And Octavia can now create loadbalancer on demand.

I'm gathering a juju crash-dump at this moment.

Revision history for this message

Jeff Hillman (jhillman) wrote on 2021-02-02:

#15

juju crashdump for my env https://drive.google.com/file/d/1U28QH6wQhJJl5UDvq_HeL9DktQfA3ZiR/view

Revision history for this message

Jeff Hillman (jhillman) wrote on 2021-02-03:

#16

FWIW, I had the exact same scenario on a new install, had to up/down the interface again.

Revision history for this message

Hybrid512 (walid-moghrabi) wrote on 2021-02-08:

#17

On my side, I did deployments from scratch many times with different scenarii (HA vs non-HA, 1 subnet for all, subnet separation for internal/admin/public spaces, ...) and still the same : Octavia just don't work.
I also tried the different workarounds and still, this thing just don't work so if there is one way to make it work as it should, please, I beg you to tell me because this thing is making me crazy.

Revision history for this message

John A Meinel (jameinel) wrote on 2021-02-20:

#18

triaged as medium to be clear that we aren't actively working on this and would have to bump something we are actively working on to tackle this.
I am very concerned about charms that could only be deployed to on a single provider that happens to provide an exact feature, but I do feel that containers that come up on MAAS nodes should be equivalent to hosts that come up on MAAS nodes.

Changed in juju:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

James Page (james-page) wrote on 2021-03-02:

#19

Frode has patches up for review for OVS upstream to avoid the automatic misconfiguration of the hostname in OVS during early boot.

Revision history for this message

James Page (james-page) wrote on 2021-03-02:

#20

This work is being tracked under bug 1915829 for the openvswitch package in Ubuntu so I'll just make that link here for traceability.

Revision history for this message

Hybrid512 (walid-moghrabi) wrote on 2021-03-02:

#21

I see this is part of OpenVSwitch 2.15 but Focal is using 2.13.
Will this be backported to Focal since this is a LTS release ?

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2021-04-05:

#22

As James mentioned, the packaging work is tracked in: https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1915829

The fix has been merged into the ubuntu/focal branch of the openvswitch package:

https://code.launchpad.net/~fnordahl/ubuntu/+source/openvswitch/+git/openvswitch/+merge/399764

However, it has not been released yet and sits in the upload queue:

https://launchpad.net/ubuntu/focal/+queue?queue_state=1&queue_text=openvswitch

In short: the plan is to make an updated version available for Focal.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2021-05-01:

#23

With the overall fix being tracked against https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1915829, which is in -proposed pockets, the charm task should be marked as invalid at this point in time, so marking it as such. It would have been nice to have openvswitch added to this bug and track it in a single bug but unfortunately that isn't the case.

Changed in charm-layer-ovn:
status:	Triaged → Invalid

Frode Nordahl (fnordahl) on 2021-07-06

summary:	- ovn-chassis subordinate to octavia registered with shortname shows down + Need for managing /etc/hosts for containers
description:	updated

Revision history for this message

Felipe Reyes (freyes) wrote on 2022-07-28:

#24

I'm adding a task for the nova-compute charm, because it gets affected by this same behavior, for example this is how the list of hypervisor looks when a machine changed the fqdn to include the domain.

This produced a mismatch between ovn and nova-compute, ovn due to the ovs-record-hostname.service sticked to the first hostname configured:

With this no new instances could be launched on the hypervisor since the port binding was failing since there was no chassis for "juju-6f998c-zaza-546dd56956f7-26.project.serverstack"

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#25

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Felipe Reyes (freyes) on 2023-02-15

Changed in charm-nova-compute:
status:	New → Confirmed

Felipe Reyes (freyes) on 2023-02-15

Changed in charm-nova-compute:
assignee:	nobody → Felipe Reyes (freyes)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-02-15: Fix proposed to charm-nova-compute (master)

#26

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-nova-compute/+/873944

Changed in charm-nova-compute:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-02-21: Fix merged to charm-nova-compute (master)

#27

Reviewed: https://review.opendev.org/c/openstack/charm-nova-compute/+/873944
Committed: https://opendev.org/openstack/charm-nova-compute/commit/2bad8a0522622e9da621a28912faa42efa27d033
Submitter: "Zuul (22348)"
Branch: master

commit 2bad8a0522622e9da621a28912faa42efa27d033
Author: Felipe Reyes <email address hidden>
Date: Wed Feb 15 11:43:40 2023 -0300

Use a stable hostname to render nova.conf

    OVS introduced a new service called ovs-record-hostname.service which
    records the hostname on the first start in the ovs database to identify
    the ovn chassis, this is how it achieved a stable hostname and be
    resilient to the changes in the FQDN when the DNS gets available.

    This change introduces the same approach for nova-compute charm. In the
    first run of the NovaComputeHostInfoContext the value passed in the
    context as host_fqdn is stored in the unit's kv db, and re-used on every
    subsequent call.

This change affects only new installs since the hint to store (or not)
the host fqdn is set in the install hook.

Change-Id: I2aa74442ec25b21201a47070077df27899465814
Closes-Bug: #1896630

Changed in charm-nova-compute:
status:	In Progress → Fix Committed

Felipe Reyes (freyes) on 2023-02-24

Changed in charm-guide:
assignee:	nobody → Felipe Reyes (freyes)

OpenStack Infra (hudson-openstack) on 2023-03-17

Changed in charm-guide:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-18: Fix merged to charm-guide (master)

#28

Reviewed: https://review.opendev.org/c/openstack/charm-guide/+/876854
Committed: https://opendev.org/openstack/charm-guide/commit/78fe51678587ddffd3d2a2083dd4dfe3fb6e6f90
Submitter: "Zuul (22348)"
Branch: master

commit 78fe51678587ddffd3d2a2083dd4dfe3fb6e6f90
Author: Felipe Reyes <email address hidden>
Date: Wed Mar 8 09:25:16 2023 -0300

Stable hostname for nova-compute service

    Closes-Bug: #1896630
    Change-Id: I8372d6556ee55a230e39aac479644e162d95be4f
    Depends-On: https://review.opendev.org/c/openstack/charm-nova-compute/+/873944

Changed in charm-guide:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.