Monasca-notification gets stuck in failed state after systemd retries to restart after a kafka failure.
Had kafka hosts run out of disk space, monasca-notification daemon quits, systemd retries a few times then stops trying. Kafka issue was noticed and lowered retention on kafka to fit disk useage.
A control host went down for ram upgrade, and no notifications were received until a manual restart of monasca-notification, then received a flood of alarms from the host going down.
Some messages did expire though:
2017-03-14 18:41:45,126 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:45,127 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:46,495 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:46,628 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:47,749 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:12:32 2017
Monasca-notification is failing with:
LeaderNotAvailableError: TopicMetadata(topic='alarm-state-transitions', error=5, partitions=[])
Systemd quitting reties:
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Main process exited, code=exited, status=17/n/a
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Unit entered failed state.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Failed with result 'exit-code'.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Service hold-off time over, scheduling restart.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Start request repeated too quickly.
https://gist.github.com/rmeleromira/9e99f080c0e803b48cfef5eac18d85ac
I see two possible fixes:
1. Monasca-notifaction not being fail fast for kafka disconnection and keep retrying
2. Removing systemd retry limit