Ironic node rebalance race can lead to missing compute nodes in DB
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Mark Goddard | ||
Ocata |
New
|
Undecided
|
Unassigned | ||
Pike |
New
|
Undecided
|
Unassigned | ||
Queens |
New
|
Undecided
|
Unassigned | ||
Rocky |
New
|
Undecided
|
Unassigned | ||
Stein |
New
|
Undecided
|
Unassigned | ||
Train |
In Progress
|
Undecided
|
Unassigned | ||
Ussuri |
In Progress
|
High
|
Mark Goddard |
Bug Description
There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to.
Steps to reproduce
==================
* Deploy nova with multiple nova-compute services managing ironic.
* Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active')
* Stop all nova-compute services
* Wait for all nova-compute services to be DOWN in 'openstack compute service list'
* Simultaneously start all nova-compute services
Expected results
================
All ironic nodes appear as hypervisors in 'openstack hypervisor list'
Actual results
==============
One or more nodes may be missing from 'openstack hypervisor list'. This is most easily checked via 'openstack hypervisor list | wc -l'
Environment
===========
OS: CentOS 7.6
Hypervisor: ironic
Nova: 18.2.0, plus a handful of backported patches
Logs
====
I grabbed some relevant logs from one incident of this issue. They are split between two compute services, and I have tried to make that clear, including a summary of what happened at each point.
http://
tl;dr
c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_
c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3
c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node
c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1
c1: 19:15:55 Deletes orphan compute node (which now belongs to c3)
c3: 19:16:56 Creates resource provider
c3; 19:17:56 Uses existing resource provider
There are two major problems here:
* c1 deletes the orphan node after c3 has taken ownership of it
* c3 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned
Changed in nova: | |
assignee: | nobody → Mark Goddard (mgoddard) |
status: | New → In Progress |
tags: | added: ironic resource-tracker |
Fix proposed to branch: master /review. opendev. org/694802
Review: https:/