Canonical Juju

juju agent needs to be more resilient

Bug #1810714 reported by Joel Sing on 2019-01-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

Multiple bugs (most recently lp: #1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance.

The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts.

The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds).

Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).

See original description

Tags:

Joel Sing (jsing) on 2019-01-07

description:	updated
description:	updated

Revision history for this message

Richard Harding (rharding) wrote on 2019-01-08:

Thanks, I can't disagree with what you've got here. It's something we should get into some discussions and see about plotting out a proper path for some self-healing around this for sure.

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium
importance:	Medium → Wishlist

Haw Loeung (hloeung) on 2019-01-30

description:

updated

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Wishlist → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.