it is not appropriate for pacemaker_remote to check host status
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
masakari-monitors |
In Progress
|
Wishlist
|
suzhengwei |
Bug Description
To allow for scalability to dozens or even hundreds of nodes,
pacemaker-remote was introduced[1]. A physical host running
pacemaker-remote service shall be called remote node, and
a node running the full high-availability stack of corosync
and all pacemaker components shall be called cluster node.
Hostmonitor distinguishes remote nodes from cluster nodes
by setting restrict_
for pacemaker_remote to check host status since pacemaker_remote
service can only establish one network link between cluster node
and remote node on port 3122.
There are always multiple interfaces in a production environment
such as management network, tenant network and public network etc.
Evacuation action should be triggered when multiple network
communication break down rather than just relying on one. For
example, Live migration action might be better when only tenant
network communication break down. Cluster node can establish
multiple network links by using corosync, additionally, corosync 2
can support two interfaces and corosync 3 can support more.[2]
In addition, it is dangerous to use pacemaker-remote in a production
environment. More detailedly, the remote node status will be marked off
if pacemaker_remote service become down from active, and evacuation
action is triggered. This scenario is confusing since the real state of
node may be normal.
[1] https:/
[2] https:/
description: | updated |
Changed in masakari-monitors: | |
status: | Incomplete → In Progress |
importance: | Undecided → Wishlist |
assignee: | nobody → suzhengwei (sue.sam) |
In practice, however, Pacemaker Remote is used for this and other purposes. We (as in Masakari team) are aware of the limitations of the Pacemaker stack (and the needless burden of extra features it brings with it) and are actively working on introducing an alternative in the form of Consul monitoring: https:/ /blueprints. launchpad. net/masakari/ +spec/host- monitor- by-consul
I hope this answers your report because there is nothing else I can offer here (other than discussing other alternatives but none were provided in the report - corosync has the original limitation of 16 nodes, it's the very reason why Pacemaker Remote is used instead as a quick hack).