Conductor shutdown always triggers deregistration

Bug #1418474 reported by Mark Goddard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Medium
Mark Goddard

Bug Description

When a conductor process is shutdown, it triggers the conductor to deregister itself from the conductor database. In a multi-conductor configuration, this causes a hash ring rebalance, and other conductor processes will take over ownership of any nodes previously assigned to the lost conductor. This process can require a fair amount of overhead, with the PXE driver requiring the PXE state to be configured on the new conductor. Worse yet, if the conductor restarts, another ring rebalance will occur, reverting to the initial state via another take over.

If the shutdown period is known in advance to be short, e.g. for an upgrade, it would be advantageous for the conductor to avoid a ring rebalance. This could be done by signalling to the conductor via some mechanism that it should not degregister itself from the conductor database, but should instead allow the registration to time out. If the conductor is restarted before the registration times out, no ring rebalances will occur.

The proposed trigger is to send SIGHUP to the conductor process.

Mark Goddard (mgoddard)
Changed in ironic:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Dmitry Tantsur (divius)
Changed in ironic:
importance: Undecided → Medium
Revision history for this message
Mark Goddard (mgoddard) wrote :

Devananda rightly pointed out that SIGHUP is not the right to for the job.

The way I see it there are two main options:

1. A trigger that causes the process to shutdown without deregistering itself.
2. A trigger that causes the process to avoid deregistering itself when it is shutdown.

I favour the second approach, as it avoids giving a new purpose to an existing signal.

The mechanism for the trigger could be:

- A signal e.g. SIGUSR1/2.
- The existence of a file, possibly with some particular contents or name to ensure it is intended for that process.
- An API call.

The simplest option is the first, and I think think has some merit. It's main drawback is the lack of available signals, which might be reissued for other purposes in future.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/155785

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/155785
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=ddc8d312e10342faa2518415d00ed9cbf60b372d
Submitter: Jenkins
Branch: master

commit ddc8d312e10342faa2518415d00ed9cbf60b372d
Author: Mark Goddard <email address hidden>
Date: Thu Feb 5 02:07:42 2015 +0000

    Avoid deregistering conductor following SIGUSR1

    Allow the conductor to avoid deregistering itself on shutdown, after
    receiving a SIGUSR1 signal. The registration will time out after a
    period defined by the conductor.heartbeat_timeout configuration setting
    (defaults to 60 seconds). If the conductor is restarted within this
    period, the unnecessary thrash caused by two ring rebalances will be
    avoided. This is useful in situations where the downtime is negligible,
    such as an upgrade.

    DocImpact
    Closes-bug: #1418474
    Change-Id: Ie40a7f878c2845dc9cb8fc8082df5d88adb28d0b

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → kilo-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.