Canonical Juju

juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction ObjectIdHex("stuff")

Bug #1643795 reported by Junien Fridrick on 2016-11-22

This bug affects 9 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

Hi,

Today a controller agent jujud starting going crazy, using a whole lot of CPU, taking mongodb along with it.

The first sign of madness in machine-0 logs are :

2016-11-22 01:36:34 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: failed to reload transaction: not found
2016-11-22 01:36:37 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction ObjectIdHex("5833a0fb9e26051a7b966e52")

with the last line repeating quite a bit.

We restarted jujud-machine-0 and juju-db on the controller, and apparently they're back up and working, even though jujud is still consuming between 150 and 200% CPU, and complaining about the missing transaction.

This is not an HA-enabled controller, so there is a single, non-replicated mongodb database.

We haven't run mgopurge.

I tried looking for the transaction, and indeed it's gone. The following displays nothing :
juju:PRIMARY> db.txns.find(ObjectId("5833a0fb9e26051a7b966e52"))

While looking at syslog, I couldn't find where this transaction got removed. I found references to it in the "leases" collection, and the first entry for this collection is quite MASSIVE :

{ "_id" : "3076959e-34a7-4400-8a66-d8e00c070486:clock#application-leadership#", "type" : "clock", "namespace" : "application-leadership", "writers" : { "machine-0" : NumberLong("1479778593916131304") }, "model-uuid" : "3076959e-34a7-4400-8a66-d8e00c070486", "txn-revno" : NumberLong(582949), "txn-queue" : [ "580774929e26053ec3d5565b_dd10f0cb", "5808d0749e26053ec3d69e15_882d8cac", <= and more that 55000 (that's right, 55k) txn ids follow. Is that normal ?

Note that this bug appears to be similar to https://bugs.launchpad.net/juju-core/+bug/1528261

This is with juju version 2.0.2

Tags:

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-22:

Relevant logs at : https://private-fileshare.canonical.com/~axino/lp1643795

If you need anything, please let me know, thanks !

tags:	added: canonical-is
Changed in juju:
importance:	Undecided → High

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-11-23:

This means that MongoDB has managed to lose data and some of the lost data is mgo/txn's transaction bookkeeping data. This means that no transaction can run.

The reason for the 55000 txn-queue entries on that document is that a transaction affects that document every 30s or so but because no transactions can currently complete so pending transactions are building up.

The only solution is to shut down the controller machine agent and run mgopurge. Please make sure you use the latest version from https://github.com/juju/mgopurge. Some important fixes were made to mgopurge last week.

If you need someone to build mgopurge for you and send you the binary let me know.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-11-23:

@Junien,
There is a detailed step-by-step guide on how to run mgopurge on https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge.

Please reach out if you have further questions!

Changed in juju:
status:	New → Won't Fix

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-23:

Hi,

I'll take a look at mgopurge. In the meantime, shouldn't juju be able to handle these errors more gracefully ? If you have a wiki page describing how to recover from a lost transaction, I guess this is not a rare problem.

Having an unreponsive controller that is overloading the CPU is not a good experience IMO, even if the root cause is not juju itself.

Junien Fridrick (axino) on 2016-11-23

Changed in juju:
status:	Won't Fix → New

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-23:

So I ran mgopurge (which I built from github), and it worked. Then I started the machine agent, and it's still using ~150-200% CPU with no good reason. I can't get an answer to "juju status". I can see the other agents attempting to connect, only to be met with :

{"request-id":1,"error":"login for \"unit-nrpe-0\" blocked because upgrade in progress","response":"'body redacted'"} Admin[""].Login

The first document in the "leases" collection now has around 155k IDs in txn-queue. I stopped jujud-machine-0 as I don't want it to grow more.

Full logs are at https://private-fileshare.canonical.com/~axino/lp1643795/2/

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-11-23:

Did mgopurge indicate that it fixed some issues?

The high CPU and lack of responsiveness is possibly because the machine agent is busy processing all those backed up transactions. If left for a while things might return to normal.

Can you start up the machine agent and monitor the counts of documents in txns collection with a "s" field (state) of 5 or 6 (aborted or completed) vs those with other state values? You should see transactions steadily moving to aborted or completed (mostly completed).

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-11-23:

Never mind my question above. I see from the mgopurge output you linked to that mgopurge did indeed correct some transaction data for leases collection. That's good.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-11-23:

Based on the tail of the mongodb and machine agent logs you've provided, the machine agent is looking reasonably healthy and the it appears to be churning through all those backed up transactions. Giving the machine agent some time to process through them seems like the right approach.

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-23:

OK, it's back up. I'll let you know how it goes.

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-23:

#10

Another argument for stopping the machine agent (or "soft stopping" it, i.e. put it in frozen mode or something) is that you then don't have transactions getting queued and needing to be processed whenever the DB is back to OK.

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-24:

#11

Is there any way I can measure the progress of the recovery ? It's been nearly a day...

Revision history for this message

Junien Fridrick (axino) wrote on 2016-11-24:

#12

So after "just" maxxing out the CPUs for a while, juju starts taking up all the memory, and the machine becomes basically useless.

I can see the classic "2016-11-24 10:27:18 ERROR juju.worker.dependency engine.go:539 "disk-manager" manifold worker returned unexpected error: cannot list block devices: lsblk failed: fork/exec /bin/lsblk: cannot allocate memory" in the logs.

I have TRACE level logs of this happening in https://private-fileshare.canonical.com/~axino/lp1643795/3/

I'm going to have to tear down the environment tomorrow though. I guess replicating the disappearance of a mongo document is easy enough to replicate if you need further testing.

Anastasia (anastasia-macmood) on 2016-11-29

Changed in juju:
status:	New → Triaged
milestone:	none → 2.1.1

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-24:

#13

This work, whilst desirable, will not happen until later stage, post-2.3. I am triaging this back into wishlist.

Current workaround, when these log messages appear is to run mgopurge.

Changed in juju:
importance:	High → Wishlist
milestone:	2.1.1 → none

Revision history for this message

James Troup (elmo) wrote on 2017-03-10:

#14

Hey, so this bug is regularly causing outages for us and wakes people up at weekends. Wishlist is a very hostile status for people trying to use Juju in production.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-04-12:

#15

We have changes in mgopurge v1.4 (released today) and Juju 2.2 (out soon) which make significant improvement to the way txns are pruned. This makes massively reduces the amount of txn data that Juju has to track, reducing the scope for txn corruption.

José Pekkarinen (koalinux) on 2018-01-22

tags:

added: cpe-onsite

José Pekkarinen (koalinux) on 2018-02-05

tags:

added: 4010

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#16

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Wishlist → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1644396

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.