@Deins, it is definitely ok to return OCF_NOT_RUNNING in monitor, like we do right now. If Pacemaker considers resource to be active and OCF script returns OCF_NOT_RUNNING, then Pacemaker must start the resource. For instance, kill RabbitMQ while no monitor op is running. Next monitor operation will return OCF_NOT_RUNNING and Pacemaker will restart the RabbitMQ.
The problem here is that lrmd daemon sends return code of monitor operation back to crmd (or pengine?) _only_ when it changes. If the first sent error is lost by Pacemaker, the resource is damned to be stuck in broken state until return code changes by miracle.
For example, in that case the following would help as well:
* change OCF script to return OCF_SUCCESS instead of OCF_NOT_RUNNING
* wait for several monitor runs to succeed and then revert the changes
lrmd would return OCF_NOT_RUNNING and that time Pacemaker most probably will restart the resource.
@Deins, it is definitely ok to return OCF_NOT_RUNNING in monitor, like we do right now. If Pacemaker considers resource to be active and OCF script returns OCF_NOT_RUNNING, then Pacemaker must start the resource. For instance, kill RabbitMQ while no monitor op is running. Next monitor operation will return OCF_NOT_RUNNING and Pacemaker will restart the RabbitMQ.
The problem here is that lrmd daemon sends return code of monitor operation back to crmd (or pengine?) _only_ when it changes. If the first sent error is lost by Pacemaker, the resource is damned to be stuck in broken state until return code changes by miracle.
For example, in that case the following would help as well:
* change OCF script to return OCF_SUCCESS instead of OCF_NOT_RUNNING
* wait for several monitor runs to succeed and then revert the changes
lrmd would return OCF_NOT_RUNNING and that time Pacemaker most probably will restart the resource.