Juju does not surface errors polling CMR offers on other controllers

Bug #1958372 reported by Michele Mancioppi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

When a Juju model is consuming SAAS offers from a controller, and the latter is unceremoniously nuked and recreated, the controller consuming those offers is (a) inconsolable (and its woe of grief spams the logs like no tomorrow, see [1]) and (2) in utter denial (see Juju status at [2]). Now, I'd expect the SAAS offers to go to status "error" when [1] occurs. Also, I'd expect something more informative than "server misbehaving" in the logs, specifically something to the extent of "this is not the controller we used to talk to, they know nothing of these offers!"

[1]

controller-0: 00:10:20 ERROR juju.worker.remoterelations error in remote application worker for cos-prometheus: opening facade to remote model: cannot resolve "controller-service.controller-development.svc.cluster.local": lookup controller-service.controller-development.svc.cluster.local on 127.0.0.53:53: server misbehaving

[2]

michele@boombox:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
borkedornot lxd localhost/localhost 2.9.19 unsupported 11:03:46+01:00

SAAS Status Store URL
cos-grafana active development admin/lma.grafana-dashboards
cos-prometheus active development admin/lma.prometheus-scrape

Revision history for this message
Michele Mancioppi (michele-mancioppi) wrote :

In other words, to repro:

(1) create a LXD controller
(2) create a microk8s controller
(3) deploy something in microk8s that is offered to and consumed by the LXD controller
(4) nuke the microk8s controller and spin up another in its place
(5) see the LXD controller get really, really confused

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9.24
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ian Booth (wallyworld) wrote :

I've tried to reproduce this on 2.9.19 (the version suggested by the bug description) and also 2.9.22 (a cmr worker shutdown bug 1952796 was fixed in 2.9.22).

On 2.9.22, on microk8s, I deployed and offered charmed-osm-mariadb-k8s.
On a LXD controller, I deployed mediawiki and related to the k8s mariadb offer.

I then ran juju kill-controller on the microk8s controller. juju status on the consuming model showed the mariadb SAAS as "terminated" and then shortly after it was removed from status. The logs showed the network related "server misbehavng" error (not repeated).

I bootstrapped a new microk8s controller with the same name as before and recreated the offer and added the cross model relation again and things worked as expected.

On 2.9.19, killing the controller resulted in the mariadb SAAS showing as "terminated" but it was not removed. I recreated the microk8s controller and this time needed to remove the terminated SAAS, after which I could relate mediawiki to the k8s mariadb offer and things worked as expected.

I am doing this on a single host so the replacement microk8s controller does have the same IP address as the original one. Perhaps that's why it works for me.

Can you confirm whether the replacement microk8s controller in your scenario was created on a different host?

Revision history for this message
Michele Mancioppi (michele-mancioppi) wrote :

It might detect the SAAS offer being terminated because you do kill the controller. I don’t: I uninstall the juju and microk8s snaps directly. I realize this is kinda a “saboteur” test, but we should not assume Juju controllers are around to be killed in case of infrastructure failure.

Revision history for this message
Michele Mancioppi (michele-mancioppi) wrote :

And, no, I always create the controllers on the same host. However, the IP of MicroK8s changes because I flip-flop a bunch between networks at home and at the office.

Revision history for this message
Ian Booth (wallyworld) wrote :

I can reproduce the behaviour using 2 LXD controllers, killing the one with the offer.

When the consuming side connects to an offering controller and an error occurs, the worker which hosts that functionality will restart because it's hard to tell if a given error is transient - the restart will cause a retry. Especially when the root cause of an error is not an internal juju issue but something network related, the best juju can do is log the fact that the API connection to the target controller could not be opened.

What juju does not do but should is mark the SAAS application as in error which needs to be fixed.

In response, what the user can do is "juju remove-saas foo --force". It can be expected that the log spam would then stop as the need to contact the target controller is now redundant since the SAAS application is gone. I think though there's a bug in that doing this might now stop the polling of the controller so that should be fixed.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR will show the SAAS error if the offering controller is stopped / removed.
With LXD, I stopped the controller instance, saw the error. Then I started the instance again and SAAS status went back to active.

https://github.com/juju/juju/pull/13637

There's a fair bit more work to do to add the ref counting needed to shutdown polling of the offering controller if it goes away. That will need to be a separate PR.

Revision history for this message
Ian Booth (wallyworld) wrote :

I'll split this bug in 2 as separate PRs will be done for each issue.
This bug: as per the above PR, status will show any error polling the target controller.
Additional bug: juju does not stop polling a controller even when all offers on that controller are no longer being consumed.

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
summary: - Juju does not recover from SAAS offers from controllers that do not
- exist anymore
+ Juju does not surface errors polling CMR offers on other controllers
Revision history for this message
Ian Booth (wallyworld) wrote :

This is the second bug raised https://bugs.launchpad.net/juju/+bug/1958446

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.24 → 2.9.25
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.