Monasca

Monasca-notification stuck in failed state after disconnecting from kafka

Bug #1672864 reported by Ramon Melero on 2017-03-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Monasca	New	Undecided	Unassigned

Bug Description

Monasca-notification gets stuck in failed state after systemd retries to restart after a kafka failure.

Had kafka hosts run out of disk space, monasca-notification daemon quits, systemd retries a few times then stops trying. Kafka issue was noticed and lowered retention on kafka to fit disk useage.

A control host went down for ram upgrade, and no notifications were received until a manual restart of monasca-notification, then received a flood of alarms from the host going down.

Some messages did expire though:

2017-03-14 18:41:45,126 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:45,127 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:46,495 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:46,628 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:10:16 2017
2017-03-14 18:41:47,749 WARNING monasca_notification.processors.alarm_processor Received alarm older than the ttl, skipping. Alarm from Tue Mar 14 02:12:32 2017

Monasca-notification is failing with:

LeaderNotAvailableError: TopicMetadata(topic='alarm-state-transitions', error=5, partitions=[])

Systemd quitting reties:

Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Main process exited, code=exited, status=17/n/a
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Unit entered failed state.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Failed with result 'exit-code'.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Service hold-off time over, scheduling restart.
Mar 13 16:03:05 monasca-api-container-08d216ae systemd[1]: monasca-notification.service: Start request repeated too quickly.

https://gist.github.com/rmeleromira/9e99f080c0e803b48cfef5eac18d85ac

I see two possible fixes:

1. Monasca-notifaction not being fail fast for kafka disconnection and keep retrying

2. Removing systemd retry limit

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.