I'm re-opening this bug as it appears that the previous fix is not sufficient to resolve the problem.
When nova-compute is related with ceph it requests resources such as keys and pools by using the ceph broker api. When requests are responded to, the unit that sent the request gets a response containing an echo of the request id it sent out along with its request. Now the problem we have is that these responses go to all nova-compute units so each time a new nova-compute unit relates with ceph and sends a request, its response is send to all units and they currently have no way to know whether they have already received their own response so they blindly all restart nova-compute (see code at [1]).
In this case the request from the new unit nova-compute-4 has be responded to all compute units which results in all of them restarting the nova-compute service.
I'm re-opening this bug as it appears that the previous fix is not sufficient to resolve the problem.
When nova-compute is related with ceph it requests resources such as keys and pools by using the ceph broker api. When requests are responded to, the unit that sent the request gets a response containing an echo of the request id it sent out along with its request. Now the problem we have is that these responses go to all nova-compute units so each time a new nova-compute unit relates with ceph and sends a request, its response is send to all units and they currently have no way to know whether they have already received their own response so they blindly all restart nova-compute (see code at [1]).
So e.g. a request looks like:
~$ juju run --unit nova-compute/0 'relation-get -r `relation-ids ceph` - nova-compute/0' 4b5e-11e7- 8a92-fa163e37c6 82", address: 10.5.59.164
broker_req: '{"api-version": 1, "request-id": "f1c63e45-
"ops": []}'
private-
and a response looks like:
~$ juju run --unit nova-compute/0 'relation-get -r `relation-ids ceph` - ceph/2' rsp-nova- compute- 0: '{"request-id": "f1c63e45- 4b5e-11e7- 8a92-fa163e37c6 82", rsp-nova- compute- 1: '{"request-id": "457b8757- 4b5f-11e7- a324-fa163e2db5 b7", rsp-nova- compute- 2: '{"request-id": "31c08f4b- 4b5f-11e7- 9756-fa163e7e11 e1", rsp-nova- compute- 3: '{"request-id": "cfc17751- 4b63-11e7- 976d-fa163e6cc7 37", rsp-nova- compute- 4: '{"request-id": "0dab0cb5- 4b6b-11e7- 8e4b-fa163e0537 76", rsp-nova- compute- 4: '{"request-id": "3ed7d4d9- 4b7b-11e7- a2b9-fa163e7084 4e", 4b7b-11e7- a2b9-fa163e7084 4e", "exit-code": 0}' public- address: 10.5.59.150 An1KrnC5qT9MA7H Al0Ymvnw= = address: 10.5.59.150
auth: cephx
broker-
"exit-code": 0}'
broker-
"exit-code": 0}'
broker-
"exit-code": 0}'
broker-
"exit-code": 0}'
broker-
"exit-code": 0}'
broker-
"exit-code": 0}'
broker_rsp: '{"request-id": "3ed7d4d9-
ceph-
key: AQAmvzdZ4ekkFhA
private-
In this case the request from the new unit nova-compute-4 has be responded to all compute units which results in all of them restarting the nova-compute service.
[1] https:/ /github. com/openstack/ charm-nova- compute/ blob/master/ hooks/nova_ compute_ hooks.py# L402