juju-core

1.23.2.1, mongo: document is larger than capped size

Bug #1454891 reported by David Britton on 2015-05-14

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
juju-core	Fix Released	High	Ian Booth	juju-core 1.25-alpha1
1.23	Fix Released	High	Ian Booth	juju-core 1.23.4
1.24	Fix Released	High	Ian Booth	juju-core 1.24-beta6

Bug Description

Nothing too special here, just a local environment running a few ubuntu services where I was testing the landscape client charm.

machine-0: 2015-05-14 00:50:02 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659012 > 1048576
machine-0: 2015-05-14 00:51:03 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659092 > 1048576
machine-0: 2015-05-14 00:52:12 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659192 > 1048576
machine-0: 2015-05-14 00:53:21 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659240 > 1048576
machine-0: 2015-05-14 00:54:21 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659320 > 1048576
machine-0: 2015-05-14 00:55:30 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659372 > 1048576
machine-0: 2015-05-14 00:56:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659452 > 1048576
machine-0: 2015-05-14 00:57:40 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659552 > 1048576

Dump of environment attached. Here is a stat dump requested by menn0 as well that didn't yield too many results:

MongoDB shell version: 2.6.3
connecting to: 127.0.0.1:37017/juju
{
        "ns" : "juju.txns.log",
        "count" : 38600,
        "size" : 4742664,
        "avgObjSize" : 122.86694300518134,
        "storageSize" : 10002432,
        "numExtents" : 1,
        "nindexes" : 1,
        "lastExtentSize" : 10002432,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 1324512,
        "indexSizes" : {
                "_id_" : 1324512
        },
        "capped" : true,
        "max" : NumberLong("9223372036854775807"),
        "ok" : 1
}
bye

Tags:

Revision history for this message

David Britton (dpb) wrote on 2015-05-14:

mongo-dump.tgz Edit (2.6 MiB, application/x-tar)

Menno Finlay-Smits (menno.smits) on 2015-05-14

Changed in juju-core:
assignee:	nobody → Menno Smits (menno.smits)

Revision history for this message

David Britton (dpb) wrote on 2015-05-14:

machine-0.log Edit (1.2 MiB, text/plain)

Curtis Hovey (sinzui) on 2015-05-15

tags:	added: mongodb
Changed in juju-core:
status:	New → Triaged
milestone:	none → 1.25.0
importance:	Undecided → High

Revision history for this message

John A Meinel (jameinel) wrote on 2015-05-22:

In the initial paste (private link): https://pastebin.canonical.com/131827/
log line attempted (677k) over max size(10k), printing beginning and end ... update juju.txns.stash query: { _id: { c: "lease", id: "trusty-server-leadership" }, n: "d5b7100b" } update: { $set: { txn-queue: [ "55352def91a7b10f460007b4_127104b4", "55352e1091a7b10f46000802_01edfd48",...

That looks like something that is having severe problems with conflicting transactions on the trusty-server-leadership document. Enough so that it ends up with hundreds/thousands? of txn-queue ids (these are transactions that want to be applied to the document, but are unable to be cleaned up for some reason.)

Also on the document:
txn-revno: -37441
I'm pretty sure txn-revno is supposed to stay a positive value. And negative values are used to toggle some sort of boolean state.
https://github.com/go-mgo/mgo/blob/v2/txn/flusher.go#L481

Remove operations try to invert the revno, as do Insert operations (Insert because it should start at -1), and Update operations are supposed to just increment it.

So maybe one of the txns is trying to Remove the document, and a bunch of others are trying to Update it?

Revision history for this message

John A Meinel (jameinel) wrote on 2015-05-22:

Note that I think the symptom:
machine-0: 2015-05-14 00:50:02 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659012 > 1048576

Is just that things got out of hand internally and we ended up with so many txn ids on the document that we ran out of document size.

I have the feeling we might be running into a lost transaction, causing a bunch of pending transactions to queue up, and it is just happening on the leadership document because that is a document that we try to touch very often (every 30s?)

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-22:

I restored the DB locally and ran the (fixed) PurgeMissing function over all collections. It found no problems - so it doesn't appear that we have a txn missing from the txns collection.

Still looking.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-22:

There are 4953 txns in the "preparing" state, 2 txns in the "prepared" and 1 (stuck?) in the "applying" state. They are all leadership update attempts for the "ubuntu" charm.

The code that generates these txns looks potentially problematic to me:

    ops := []txn.Op{
        // First remove anything that's there.
{
            C: p.collectionName,
            Id: id,
            Remove: true,
        },
        // Then insert the token.
        {
            Assert: txn.DocMissing,
            C: p.collectionName,
            Id: id,
            Insert: entity,
        },
    }

Removing and inserting the same id in the one txn could be triggering a problem in mgo/txn.

Could a simulated upsert be used instead?
http://stackoverflow.com/questions/24455478/simulating-an-upsert-with-mgo-txn

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-22:

For completeness here's how the stuck txns look:

{
"_id" : ObjectId("554bb019c3557c04cc001705"),
"s" : 1,
"o" : [
  {
   "c" : "lease",
   "d" : "ubuntu-leadership",
   "r" : true
  },
  {
   "c" : "lease",
   "d" : "ubuntu-leadership",
   "a" : "d-",
   "i" : {
    "lastupdate" : ISODate("2015-05-07T18:34:01.647Z"),
    "token" : {
     "namespace" : "ubuntu-leadership",
     "id" : "ubuntu/0",
     "expiration" : ISODate("2015-05-07T18:35:01.647Z")
    }
   }
  }
]
}

Only the lastupdate and expiration values vary. Looking at the last update values the problem started around 2015-05-07 18:34 with the latest txns being for right up to the time the DB dump was taken.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-22:

Actually, it looks like things went bad a little before that time but certainly on 7 May.

Strangely it seems like the logs just stop from 2015-05-05 19:04:06 until 2015-05-07 14:30:51 where the resumer starts complaining. This is local provider deployment so perhaps the host was off/suspended?

When the resumer starts reporting problems the initial messages are:

2015-05-07 14:30:51 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: !loc.isNull()
2015-05-07 14:31:56 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: !loc.isNull()
2015-05-07 14:32:57 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1048580 > 1048576

I'm not sure what those first 2 messages mean.

Revision history for this message

William Reade (fwereade) wrote on 2015-05-22:

Hmm, yes, ISTM that that lease code will only work if the document's not already there. I'd better go take a look at that.

Revision history for this message

William Reade (fwereade) wrote on 2015-05-22:

#10

1) Those `!loc.isNull()` messages seem to be coming out of mongo, but I'm not sure why.
2) Those transactions are (somewhat to my surprise) fine (at least in that the assert doesn't fail as I would have expected it to -- it seems that mgo/txn does handle that case).
3) Those transactions are very much *not* fine because they're unconditional writes that can be arbitrarily delayed and run in any order, rendering them... er, unhelpful for tracking lease ownership.

Can't immediately see a path from (3) to the bug, but that's what leaps out at me.

Revision history for this message

David Britton (dpb) wrote on 2015-05-22:

#11

Note, I just filed this issue, which is happening on a non-local
provider (openstack), 1.23.0. Is it the same thing? Should it be duped
out here?

bug #1457645

If someone wants to log into that machine and take a look, they
certainly can. Let me know.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-24:

#12

The symptoms for bug 1457645 are somewhat different but the problems could be related. I've added some more detail to the other ticket.

Changed in juju-core:
assignee:	Menno Smits (menno.smits) → nobody

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-05-27:

#13

I'm marking this as Fix Committed on the basis that the fix for bug 1457645 may have fixed this also. Please re-open if the issue re-appears.

Ian Booth (wallyworld) on 2015-05-27

Changed in juju-core:
assignee:	nobody → Ian Booth (wallyworld)
status:	Triaged → Fix Committed

Curtis Hovey (sinzui) on 2015-06-05

Changed in juju-core:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.