Monasca-notification stuck in failed state after disconnecting from kafka

Bug #1672864 reported by Ramon Melero
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Monasca
New
Undecided
Unassigned

Bug Description

Monasca-notification gets stuck in failed state after systemd retries to restart after a kafka failure.

Had kafka hosts run out of disk space, monasca-notification daemon quits, systemd retries a few times then stops trying. Kafka issue was noticed and lowered retention on kafka to fit disk useage.

A control host went down for ram upgrade, and no notifications were received until a manual restart of monasca-notification, then received a flood of alarms from the host going down.

Some messages did expire though:

2017-03-14 18:41:45,126 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:45,127 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:46,495 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:46,628 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:47,749 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:12:32 2017

Monasca-notification is failing with:

LeaderNotAvailableError: TopicMetadata(topic='alarm-state-transitions', error=5, partitions=[])

Systemd quitting reties:

Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Main process exited, code=exited, status=17/n/a
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Unit entered failed state.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Failed with result 'exit-code'.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Service hold-off time over, scheduling restart.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Start request repeated too quickly.

https://gist.github.com/rmeleromira/9e99f080c0e803b48cfef5eac18d85ac

I see two possible fixes:

1. Monasca-notifaction not being fail fast for kafka disconnection and keep retrying

2. Removing systemd retry limit

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.