Compute node deletes itself if rebooted without DNS
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Nova Compute Charm |
New
|
Undecided
|
Unassigned |
Bug Description
Reproduced on: bionic-queens, focal-wallaby
A normal-running nova-compute service with instances can have its DB suffer drastic damage by having its FQDN changed due to external factors that may be beyond control and always have some chance of happening, such a network outage issue or DNS server issue.
What happens is that the code at [0] deletes the compute node entry in nova.compute_nodes table because the FQDN is "different" when such an external problem happens. In fact, it changes from:
"juju-b93c20-
So because the FQDN is different, the nova-compute service believes it is a different service and the old one registered is an orphan, and then a cascading series of mistakes follow:
1) Deletes itself from the nova.compute_nodes table
*2) Deletes the allocations from the old resource provider in nova_api/
*3) Deletes the resource provider in nova_api/
4) Registers a new compute node in nova.compute_nodes
5) Registers a new empty resource provider in nova_api/
* In queens my compute service was successfully able to perform those steps, but in wallaby I got the following errors, under the same circumstances.
2021-08-13 19:37:08.636 3300 DEBUG nova.scheduler.
2021-08-13 19:37:08.685 3300 ERROR nova.scheduler.
The series of cascading issues continues as after step (5) above the node behaves "normally", therefore the customer creates more instances, and later when the node is later restarted, it reverts to its old FQDN, and repeats the problem again, however, a bit differently in queens and wallaby:
wallaby: It fails to re-create the resource provider, as it had not successfully deleted the old one. nova.exception.
queens: It repeats steps 1-5, so new VMs get their allocations deleted as well, and the node is functional after another restart with its FQDN restored.
So in queens it is usable after FQDN is restored, while in wallaby it is not, and in both cases DB surgery is needed to fix all inconsistencies.
In the end, this issue is very annoying and it causes a lot of inconsistencies in the DB that need to be repaired through DB surgery, for such an external problem that is sometimes beyond control and has some chance of happening.
I've seen this happen many times with customers but hadn't been able to pinpoint the root cause because I used to just notice a lot of allocations issues (more specifically instances running without allocations) a long time after the FQDN problem had happened, in which the customer had already performed many different changes to restore functionality, while being unaware that allocations were inconsistent, and then raising other problems such as not able being to properly create instances some time in the future, as a consequence of the missing allocation entries in nova_api/
Steps to reproduce:
===================
Variation 1
~~~~~~~~~~~
- edit /etc/hosts
- add your IP, FQDN and hostname similar to example below
10.5.0.134 juju-b93c20-
Edit the FQDN to make it slightly different (in this example the correct was maas, I changed it to maas5)
- restart nova-compute service
Variation 2
~~~~~~~~~~~
- edit your network configuration to change dhcp to static IP, make sure to not include DNS or gateway, just the IP and submask
- reboot node
Discussed this in nova meeting [0]. The meeting conclusion was that the code logic should not change to attempt to address this. The current behavior is a design decision and the node should be configured in a way that prevents the problem. It was suggested that the following approaches are attempted to prevent the problem:
1) use "host" config option in nova.conf
2) set up the hostname in /etc/hosts
2a) a different canonical hostname that is not FQDN so it isn't prone to this problem
2b) set up the FQDN there to prevent the hostname from changing if there is a DNS outage
3) set up a fixed domain in /etc/domainname
I have tried option (1) above, but it does not solve the problem. The value set in the config gets overriden by the system FQDN. As mentioned by Sean Mooney in the meeting, that value comes from libvirt, which apparently will read from the system, not the config file.
I am yet to explore #2 and #3 above.
[0] https:/ /meetings. opendev. org/meetings/ nova/2021/ nova.2021- 08-17-16. 01.log. html#l- 12