canceling destroy-model prevents further calls to destroy-model to succeed

Bug #1676279 reported by Thomas Cuthbert
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Anastasia

Bug Description

Trying to remove a model like:

juju destroy-model x and cancelling out is now making it impossible to remove juju models.

This is a HA'd controller environment.

summary: - juju2.1 controller not removing models correctly
+ juju2.1.1 controller not removing models correctly
information type: Public → Private
no longer affects: juju-core
information type: Private → Public
Revision history for this message
Thomas Cuthbert (tcuthbert) wrote : Re: juju2.1.1 controller not removing models correctly

Trying to list models whilst in this broken state produces:

ERROR cannot get model details: getting model xx-xxxx-xxx ("xxxx") info: settings not found

Revision history for this message
Junien Fridrick (axino) wrote :

FWIW, the model is still in the database :

juju:PRIMARY> db.models.find({"name": "foo"})
{ "_id" : "deb718c6-2a6d-4df9-880f-7c8665a3570a", "name" : "foo", "life" : 2, "owner" : "admin", "controller-uuid" : "4955af7b-a72f-4076-88e4-99854e92d50f", "migration-mode" : "", "cloud" : "REDACTED", "cloud-region" : "REDACTED", "cloud-credential" : "REDACTED", "txn-revno" : NumberLong(4), "txn-queue" : [ REDACTED ], "time-of-dying" : ISODate("2017-03-27T04:26:04Z"), "time-of-death" : ISODate("2017-03-27T04:50:57Z") }

Important fields :
"time-of-dying" : ISODate("2017-03-27T04:26:04Z")
"time-of-death" : ISODate("2017-03-27T04:50:57Z")

That's 24 minutes. Thomas said he interrupted the destroy after ~5 min, so I presume jujud finished in the background. However, as you can see, the model is still in the juju.models collection, presumably because destroying a model is a 2 steps process : first destroy all the associated ops (i.e. txns ?), then remove the model from the database (as I understand it from https://github.com/juju/juju/blob/staging/cmd/juju/model/destroy.go#L166, with DestroyModel / RemoveModel)

Indeed, if I look for units, machines, applications, settings or statuseshistory for the removed model, I can't find anything :

juju:PRIMARY> db.settings.find({"model-uuid" : "deb718c6-2a6d-4df9-880f-7c8665a3570a"}).count()
0
juju:PRIMARY> db.units.find({"model-uuid" : "deb718c6-2a6d-4df9-880f-7c8665a3570a"}).count()
0
juju:PRIMARY> db.machines.find({"model-uuid" : "deb718c6-2a6d-4df9-880f-7c8665a3570a"}).count()
0
juju:PRIMARY> db.applications.find({"model-uuid" : "deb718c6-2a6d-4df9-880f-7c8665a3570a"}).count()
0
juju:PRIMARY> db.statuseshistory.find({"model-uuid" : "deb718c6-2a6d-4df9-880f-7c8665a3570a"}).count()
0

Also, all the transactions in the txn-queue of the model are in state 6 ("applied").

I think juju should prevent this from happening, because this is breaking the "juju list-models" command. This is why this bug should at least be Important, if not Critical.

Also, very important question : for controllers impacted by this bug, is it safe to remove such models from the juju.models collection ? I can't think of another way to fix this. If you have another solution/workaround, please let us know.

Thanks !

Junien Fridrick (axino)
summary: - juju2.1.1 controller not removing models correctly
+ juju 2.1.1 controller not removing models correctly (list-models showing
+ "info: settings not found")
Ian Booth (wallyworld)
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.2-beta3
Revision history for this message
Tim Penhey (thumper) wrote : Re: juju 2.1.1 controller not removing models correctly (list-models showing "info: settings not found")

Can we have access to the controller machine logs?

We are pretty sure that something is erroring out internally.

Changed in juju:
status: Triaged → Incomplete
milestone: 2.2-beta3 → none
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Whilst these symptoms and reproducible scenarios are different to the ones in the bug # 1674805, I believe that the cause is the same. The fix for one will fix the other.

I believe that the problem is that we are gathering models to list in a separate API call from the one where we retrieve the details for the gathered models. This introduces potential discrepancies in the situations where models may change their life cycle state between these calls. If we have one API call that both builds the list and populates model details, the problem will no longer manifest.

tags: added: list-models
Revision history for this message
Junien Fridrick (axino) wrote :

@anastasia-macmood : perhaps model listing/gathering could use some improvements, but if anything, I'd say this is a bug in model removal, not in model listing.

I believe this model should have been removed entirely from the juju.models collection. In fact, there are only 2 models with "life" set to 2 in this collection : this model, and another one that's causing the exact same error message.

As requested by @thumper, you will find the controller logs at : https://private-fileshare.canonical.com/~axino/lp1676279

I took the logs from all 3 controllers, grepping the model-uuid and the model name, and sorted them in time into a single file.

All the "juju.worker.dependency engine.go:547 "firewaller" manifold worker returned unexpected error: security groups matching" logs are LP#1675799 and can probably be ignored.

If you need anything else, please let us know :)

Cheers

Changed in juju:
status: Incomplete → New
tags: added: destroy-model
Changed in juju:
status: New → Triaged
milestone: none → 2.2-beta3
description: updated
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Junien Fridrick (axino),

All collections that may have model related data are:
* toolsmetadata;
* modelusers;
* blocks;
* cleanups;
* sequence;
* leases;
* charms;
* applications;
* units;
* minunits;
* assignUnits;
* meterStatus;
* refcounts;
* relations;
* relationscopes;
* containerRefs;
* instanceData;
* machines;
* reboot;
* sshhostkeys;
* machineremovals;
* blockdevices;
* filesystems;
* filesystemAttachments;
* storageinstances;
* storageattachments;
* volumes;
* volumeattachments;
* providerIDs;
* spaces;
* subnets;
* linklayerdevices;
* linklayerdevicesrefs;
* ip.addresses;
* endpointbindings;
* openedPorts;
* actions;
* actionnotifications;
* payloads;
* resources;
* annotations;
* settings;
* constraints;
* storageconstraints;
* statuses;
* statuseshistory.

If you are, however, sure that you want to delete anything related to a particular model from the database, Menno's advice is to run from the shell:

 db.<collection>.remove({_id: $regex: "^<model uuid>:"})

Model document can also be removed.

Once these removals are done, you should restart the controller agents since these updates are being done outside the transaction system and there may be some things running that haven't yet noticed that the model is gone.

If you have a testing environment, please trial there first :D

I am looking into the cause of the failure to destroy model in the first place.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

As a further update, trying to reproduce with Juju 2.2 tip, even if I interrupt destroy-model command, model gets eventually removed successfully without any further interactions from me.

I believe we have fixed whatever was causing destruction to fail. I am trying to figure out how else to reproduce 'list models' failure in our current codebase.

Revision history for this message
Junien Fridrick (axino) wrote :

So I just did that, there were a few leases to be removed. I also removed the model documents (2 models were impacted).

When restarting the controller agents, they started complaining though :

2017-04-05 12:59:59 ERROR juju.worker.dependency engine.go:547 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {leases deb718c6-2a6d-4df9-880f-7c8665a3570a:clock#singular-controller#} for applying transaction 58e4e56d5f5ce81549698396_4b15c248

so I re-stopped all the controllers, and ran mgopurge, which _also_ failed :

2017-04-05 13:20:10 failed stage resume: cannot find document {leases deb718c6-2a6d-4df9-880f-7c8665a3570a:clock#singular-controller#} for applying transaction 58e4e56d5f5ce81549698396_4b15c248

isn't it exactly supposed to take care of such txns ?

Anyway, the txn was only touching a removed document, so I set its state to "6" (applied), and then everything was happy.

juju models now work !

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1676279] Re: juju 2.1.1 controller not removing models correctly (list-models showing "info: settings not found")

mgopurge goes the other way. We've had cases where documents refer to a
transaction that doesn't exist, but we haven't seen transactions that refer
to documents that don't exist, so we've never written the logic to handle
that case.
It sounds like those lease documents might have been what prevented us from
deleting the model (likely a pending transaction on the model that we were
failing to run to completion).

I'm glad you got it working, though understanding how we can make sure to
let the next one work would be good. At least knowing that we might want to
look at pending transactions is a bit of a lead.

John
=:->

On Wed, Apr 5, 2017 at 6:17 PM, Junien Fridrick <email address hidden>
wrote:

> So I just did that, there were a few leases to be removed. I also
> removed the model documents (2 models were impacted).
>
> When restarting the controller agents, they started complaining though :
>
> 2017-04-05 12:59:59 ERROR juju.worker.dependency engine.go:547 "mgo-txn-
> resumer" manifold worker returned unexpected error: cannot resume
> transactions: cannot find document {leases deb718c6-2a6d-4df9-880f-
> 7c8665a3570a:clock#singular-controller#} for applying transaction
> 58e4e56d5f5ce81549698396_4b15c248
>
> so I re-stopped all the controllers, and ran mgopurge, which _also_
> failed :
>
> 2017-04-05 13:20:10 failed stage resume: cannot find document {leases
> deb718c6-2a6d-4df9-880f-7c8665a3570a:clock#singular-controller#} for
> applying transaction 58e4e56d5f5ce81549698396_4b15c248
>
> isn't it exactly supposed to take care of such txns ?
>
> Anyway, the txn was only touching a removed document, so I set its state
> to "6" (applied), and then everything was happy.
>
> juju models now work !
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1676279
>
> Title:
> juju 2.1.1 controller not removing models correctly (list-models
> showing "info: settings not found")
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1676279/+subscriptions
>

Revision history for this message
Anastasia (anastasia-macmood) wrote : Re: juju 2.1.1 controller not removing models correctly (list-models showing "info: settings not found")

List models failures described in comment # 1 are addressed in PR against develop (2.2): https://github.com/juju/juju/pull/7265

This bug now should only focus on failure to destroy models when one of the destroy-model commands is canceled, as reported originally.

summary: - juju 2.1.1 controller not removing models correctly (list-models showing
- "info: settings not found")
+ canceling destroy-model prevents further calls to destroy-model to
+ succeed
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I can no longer reproduce this: depending on when I cancel model destruction, I either get my model destroyed successfully or not. If model destruction is canceled, I can re-run the command to destroy the model again and the model gets destroyed.

I believe that the originally reported failure has been fixed as part of the fix for bug # 1661930.

If you are still seeing this failure once 2.2-beta3 is out, please clarify reproductions steps for me.

Changed in juju:
status: Triaged → Fix Committed
assignee: nobody → Anastasia (anastasia-macmood)
Revision history for this message
Paul Gear (paulgear) wrote :

Did this fix get released with 2.2.0?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Paul Gear (paulgear),
Yes :) since the fix was committed into the codebase in time for 2.2.-beta3.

Revision history for this message
Ian Booth (wallyworld) wrote :

You can tell what release a bug is fixed in by looking at the milestone to which the bug is assigned.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.