[OVN] Hash Ring nodes removed when "periodic worker" is killed

Bug #2024205 reported by Lucas Alvares Gomes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
High
Lucas Alvares Gomes

Bug Description

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2213910

In the ML2/OVN driver we set a signal handler for SIGTERM to remove the hash ring nodes upon the service exit [0] but, during the investigation of one bug with a customer we identified that an unrelated Neutron worker is killed (such as the "periodic worker" in this case) this could lead to that process removing the entries from the ovn_hash_ring table for that hostname.

If this happens on all controllers, the ovn_hash_ring table is rendered empty and OVSDB events are no longer processed by ML2/OVN.

Proposed solution:

This LP proposes to make this more reliable by instead of removing the nodes from the ovn_hash_ring table at exiting, we would mark them as offline instead. That way, if a worker dies the nodes will remain registered in the table and the heartbeat thread will set them as online again on the next beat. If the service is properly stopped the heartbeat won't be running and the nodes will be seeing as offline to the Hash Ring manager.

As a note, upon the next startup of the service the nodes matching the server hostname will be removed from the ovn_hash_ring table and added again accordingly as Neutron worker are spawned [1].

[0] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L295-L296
[1] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L316

Changed in neutron:
status: Fix Committed → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/886279

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.4.0

This issue was fixed in the openstack/neutron 20.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/890392
Committed: https://opendev.org/openstack/neutron/commit/fbaf313bab76949ae84fde42c62398ee2c3380b7
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit fbaf313bab76949ae84fde42c62398ee2c3380b7
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Jun 16 13:44:07 2023 +0100

    [OVN] Hash Ring: Better handle Neutron worker failures

    This patch implements a more resilient approach to handle the case
    where Neutron API workers are killed and restarted. Instead of marking
    all nodes for that host as offline, this patch tries to remove the
    worker that was killed from the Hash Ring leaving all others nodes for
    that host online.

    In case the we fail to remove the node and another entry is added upon the
    restart of the worker this patch also logs a clear critical log message to
    alert the operator that there are more Hash Ring nodes than API workers
    (it's expect to be the same) and that OVSDB events could go missing if
    they are routed to the previous node that failed to be removed from the
    ring.

    Closes-Bug: #2024205
    Change-Id: I4b7376cf7df45fcc6e487970b068d06b4e74e319
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit 9e8e3a7867b689ca1bd462ddff294db030032350)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/890393
Committed: https://opendev.org/openstack/neutron/commit/ebd19805b840f591fd47be452d541407efc428b2
Submitter: "Zuul (22348)"
Branch: stable/zed

commit ebd19805b840f591fd47be452d541407efc428b2
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Jun 16 13:44:07 2023 +0100

    [OVN] Hash Ring: Better handle Neutron worker failures

    This patch implements a more resilient approach to handle the case
    where Neutron API workers are killed and restarted. Instead of marking
    all nodes for that host as offline, this patch tries to remove the
    worker that was killed from the Hash Ring leaving all others nodes for
    that host online.

    In case the we fail to remove the node and another entry is added upon the
    restart of the worker this patch also logs a clear critical log message to
    alert the operator that there are more Hash Ring nodes than API workers
    (it's expect to be the same) and that OVSDB events could go missing if
    they are routed to the previous node that failed to be removed from the
    ring.

    Closes-Bug: #2024205
    Change-Id: I4b7376cf7df45fcc6e487970b068d06b4e74e319
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit 9e8e3a7867b689ca1bd462ddff294db030032350)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/890394
Committed: https://opendev.org/openstack/neutron/commit/24a063959a543b9f21afe6e6eed9e5f27d8fa887
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 24a063959a543b9f21afe6e6eed9e5f27d8fa887
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Jun 16 13:44:07 2023 +0100

    [OVN] Hash Ring: Better handle Neutron worker failures

    This patch implements a more resilient approach to handle the case
    where Neutron API workers are killed and restarted. Instead of marking
    all nodes for that host as offline, this patch tries to remove the
    worker that was killed from the Hash Ring leaving all others nodes for
    that host online.

    In case the we fail to remove the node and another entry is added upon the
    restart of the worker this patch also logs a clear critical log message to
    alert the operator that there are more Hash Ring nodes than API workers
    (it's expect to be the same) and that OVSDB events could go missing if
    they are routed to the previous node that failed to be removed from the
    ring.

    Closes-Bug: #2024205
    Change-Id: I4b7376cf7df45fcc6e487970b068d06b4e74e319
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit 9e8e3a7867b689ca1bd462ddff294db030032350)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/890395
Committed: https://opendev.org/openstack/neutron/commit/3c8e1b4e812272e64dddcee9ec37cb14ccc1c805
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 3c8e1b4e812272e64dddcee9ec37cb14ccc1c805
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Jun 16 13:44:07 2023 +0100

    [OVN] Hash Ring: Better handle Neutron worker failures

    This patch implements a more resilient approach to handle the case
    where Neutron API workers are killed and restarted. Instead of marking
    all nodes for that host as offline, this patch tries to remove the
    worker that was killed from the Hash Ring leaving all others nodes for
    that host online.

    In case the we fail to remove the node and another entry is added upon the
    restart of the worker this patch also logs a clear critical log message to
    alert the operator that there are more Hash Ring nodes than API workers
    (it's expect to be the same) and that OVSDB events could go missing if
    they are routed to the previous node that failed to be removed from the
    ring.

    Closes-Bug: #2024205
    Change-Id: I4b7376cf7df45fcc6e487970b068d06b4e74e319
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit 9e8e3a7867b689ca1bd462ddff294db030032350)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/890396
Committed: https://opendev.org/openstack/neutron/commit/25500a6849e3576d8e4b38a85aeb6f7f2b2bdcf7
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 25500a6849e3576d8e4b38a85aeb6f7f2b2bdcf7
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Jun 16 13:44:07 2023 +0100

    [OVN] Hash Ring: Better handle Neutron worker failures

    This patch implements a more resilient approach to handle the case
    where Neutron API workers are killed and restarted. Instead of marking
    all nodes for that host as offline, this patch tries to remove the
    worker that was killed from the Hash Ring leaving all others nodes for
    that host online.

    In case the we fail to remove the node and another entry is added upon the
    restart of the worker this patch also logs a clear critical log message to
    alert the operator that there are more Hash Ring nodes than API workers
    (it's expect to be the same) and that OVSDB events could go missing if
    they are routed to the previous node that failed to be removed from the
    ring.

    Closes-Bug: #2024205
    Change-Id: I4b7376cf7df45fcc6e487970b068d06b4e74e319
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit 9e8e3a7867b689ca1bd462ddff294db030032350)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b3

This issue was fixed in the openstack/neutron 23.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.