Defunct nodes are reported as happy in network agent list

Bug #1999677 reported by Giuseppe Petralia
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron API Charm
New
Undecided
Unassigned
networking-ovn
Invalid
Undecided
Unassigned
neutron
Incomplete
Undecided
Unassigned

Bug Description

When decommissioning a node from a cloud using Neutron and OVN, the Chassis is not removed from OVN SB db and also it always shows as happy in "openstack network agent list"
which is a bit weird and the operator would expect to have that as XXX in the agent list

This is more for the upstream neutron but adding the charm for visibility.

Tags: bspostmortem
description: updated
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Giuseppe:

Some questions:
* What version of OpenStack are you using? Do you have [1] in your env?
* Did you wait the agent heartbeat time?
* Can you print the output of the "agent list" command?
* How did you decommissioned the nodes? In other words, did you stop the ovn-controller gracefully?

When you remove a compute node (that means you stop the ovn-controller and OVN metadata agent), the need to manually remove those entries from the Neutron database using the CLI (openstack neutron agent delete {}).

Regards.

[1]https://review.opendev.org/q/I17aa53cea6aba8ea83187c99102a6f25fd33cfff

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Hi Rodolfo

* we are using openstack focal-ussuri with ovn 22.03
  neutron 2:16.4.2-0ubuntu4
  ovn: 22.03.0-0ubuntu1~cloud0

* We decommissioned the node removing entirely from the cloud after the machine
  powered off for hardware issues

* I can't share the agent list as it contains hostnames of running nodes in the production cloud

* ovn-controller was not stopped gracefully as the machine failed unexpectedly

After the machine died it remained in agent list and OVN Controller Gateway agent remained with State up and Alive :-) until we removed the machine manually from the OVN SB DB with:

ovn-sbctl chassis-del <hostname>

and then restarted neutron-servers.

Also openstack neutron agent delete is not supported for OVN controller agents.

Changed in networking-ovn:
status: New → Invalid
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I'm removing "networing-ovn" project. Neutron Ussuri contains the ML2/OVN in-tree.

If the ovn-controller was not gracefully stopped, then you need to manually delete the agent from the database (that is supported since [1] in Ussuri).

Before deleting the agent, as you commented, you should manually delete the chassis from the OVN database (because the ovn-controller didn't executed this deletion as it wasn't gracefully stopped). Then you can remove the OVN agent. Because you are in Ussuri, you can't have [2] that will automatically read that the chassis register has been deleted and will update the OVN agent register. So in your case you need to:
* Remove the OVN chassis from the OVN database.
* Remove the OVN agent, or if you don't have [1][2], restart the Neutron server, as you did.

The reported issue (apart from the procedure of fixing an incorrectly removed chassis), is documented and fixed in newer versions.

Please update your Ussuri version to the latest version 16.4.2 [3]. This is the last Ussuri released tag. Please also consider bumping your OpenStack version to a newer one.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/860247
[2]https://review.opendev.org/q/I17aa53cea6aba8ea83187c99102a6f25fd33cfff
[3]https://opendev.org/openstack/neutron/src/tag/16.4.2

Changed in neutron:
status: New → Incomplete
tags: added: bspostmortem
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.