1.23.2.1, mongo: document is larger than capped size

Bug #1454891 reported by David Britton
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Ian Booth
1.23
Fix Released
High
Ian Booth
1.24
Fix Released
High
Ian Booth

Bug Description

Nothing too special here, just a local environment running a few ubuntu services where I was testing the landscape client charm.

machine-0: 2015-05-14 00:50:02 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659012 > 1048576
machine-0: 2015-05-14 00:51:03 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659092 > 1048576
machine-0: 2015-05-14 00:52:12 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659192 > 1048576
machine-0: 2015-05-14 00:53:21 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659240 > 1048576
machine-0: 2015-05-14 00:54:21 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659320 > 1048576
machine-0: 2015-05-14 00:55:30 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659372 > 1048576
machine-0: 2015-05-14 00:56:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659452 > 1048576
machine-0: 2015-05-14 00:57:40 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659552 > 1048576

Dump of environment attached. Here is a stat dump requested by menn0 as well that didn't yield too many results:

MongoDB shell version: 2.6.3
connecting to: 127.0.0.1:37017/juju
{
        "ns" : "juju.txns.log",
        "count" : 38600,
        "size" : 4742664,
        "avgObjSize" : 122.86694300518134,
        "storageSize" : 10002432,
        "numExtents" : 1,
        "nindexes" : 1,
        "lastExtentSize" : 10002432,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 1324512,
        "indexSizes" : {
                "_id_" : 1324512
        },
        "capped" : true,
        "max" : NumberLong("9223372036854775807"),
        "ok" : 1
}
bye

Revision history for this message
David Britton (dpb) wrote :
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
David Britton (dpb) wrote :
Curtis Hovey (sinzui)
tags: added: mongodb
Changed in juju-core:
status: New → Triaged
milestone: none → 1.25.0
importance: Undecided → High
Revision history for this message
John A Meinel (jameinel) wrote :

In the initial paste (private link): https://pastebin.canonical.com/131827/
log line attempted (677k) over max size(10k), printing beginning and end ... update juju.txns.stash query: { _id: { c: "lease", id: "trusty-server-leadership" }, n: "d5b7100b" } update: { $set: { txn-queue: [ "55352def91a7b10f460007b4_127104b4", "55352e1091a7b10f46000802_01edfd48",...

That looks like something that is having severe problems with conflicting transactions on the trusty-server-leadership document. Enough so that it ends up with hundreds/thousands? of txn-queue ids (these are transactions that want to be applied to the document, but are unable to be cleaned up for some reason.)

Also on the document:
txn-revno: -37441
I'm pretty sure txn-revno is supposed to stay a positive value. And negative values are used to toggle some sort of boolean state.
https://github.com/go-mgo/mgo/blob/v2/txn/flusher.go#L481

Remove operations try to invert the revno, as do Insert operations (Insert because it should start at -1), and Update operations are supposed to just increment it.

So maybe one of the txns is trying to Remove the document, and a bunch of others are trying to Update it?

Revision history for this message
John A Meinel (jameinel) wrote :

Note that I think the symptom:
machine-0: 2015-05-14 00:50:02 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1659012 > 1048576

Is just that things got out of hand internally and we ended up with so many txn ids on the document that we ran out of document size.

I have the feeling we might be running into a lost transaction, causing a bunch of pending transactions to queue up, and it is just happening on the leadership document because that is a document that we try to touch very often (every 30s?)

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I restored the DB locally and ran the (fixed) PurgeMissing function over all collections. It found no problems - so it doesn't appear that we have a txn missing from the txns collection.

Still looking.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

There are 4953 txns in the "preparing" state, 2 txns in the "prepared" and 1 (stuck?) in the "applying" state. They are all leadership update attempts for the "ubuntu" charm.

The code that generates these txns looks potentially problematic to me:

    ops := []txn.Op{
        // First remove anything that's there.
 {
            C: p.collectionName,
            Id: id,
            Remove: true,
        },
        // Then insert the token.
        {
            Assert: txn.DocMissing,
            C: p.collectionName,
            Id: id,
            Insert: entity,
        },
    }

Removing and inserting the same id in the one txn could be triggering a problem in mgo/txn.

Could a simulated upsert be used instead?
http://stackoverflow.com/questions/24455478/simulating-an-upsert-with-mgo-txn

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

For completeness here's how the stuck txns look:

{
 "_id" : ObjectId("554bb019c3557c04cc001705"),
 "s" : 1,
 "o" : [
  {
   "c" : "lease",
   "d" : "ubuntu-leadership",
   "r" : true
  },
  {
   "c" : "lease",
   "d" : "ubuntu-leadership",
   "a" : "d-",
   "i" : {
    "lastupdate" : ISODate("2015-05-07T18:34:01.647Z"),
    "token" : {
     "namespace" : "ubuntu-leadership",
     "id" : "ubuntu/0",
     "expiration" : ISODate("2015-05-07T18:35:01.647Z")
    }
   }
  }
 ]
}

Only the lastupdate and expiration values vary. Looking at the last update values the problem started around 2015-05-07 18:34 with the latest txns being for right up to the time the DB dump was taken.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Actually, it looks like things went bad a little before that time but certainly on 7 May.

Strangely it seems like the logs just stop from 2015-05-05 19:04:06 until 2015-05-07 14:30:51 where the resumer starts complaining. This is local provider deployment so perhaps the host was off/suspended?

When the resumer starts reporting problems the initial messages are:

2015-05-07 14:30:51 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: !loc.isNull()
2015-05-07 14:31:56 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: !loc.isNull()
2015-05-07 14:32:57 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: document is larger than capped size 1048580 > 1048576

I'm not sure what those first 2 messages mean.

Revision history for this message
William Reade (fwereade) wrote :

Hmm, yes, ISTM that that lease code will only work if the document's not already there. I'd better go take a look at that.

Revision history for this message
William Reade (fwereade) wrote :

1) Those `!loc.isNull()` messages seem to be coming out of mongo, but I'm not sure why.
2) Those transactions are (somewhat to my surprise) fine (at least in that the assert doesn't fail as I would have expected it to -- it seems that mgo/txn does handle that case).
3) Those transactions are very much *not* fine because they're unconditional writes that can be arbitrarily delayed and run in any order, rendering them... er, unhelpful for tracking lease ownership.

Can't immediately see a path from (3) to the bug, but that's what leaps out at me.

Revision history for this message
David Britton (dpb) wrote :

Note, I just filed this issue, which is happening on a non-local
provider (openstack), 1.23.0. Is it the same thing? Should it be duped
out here?

  bug #1457645

If someone wants to log into that machine and take a look, they
certainly can. Let me know.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The symptoms for bug 1457645 are somewhat different but the problems could be related. I've added some more detail to the other ticket.

Changed in juju-core:
assignee: Menno Smits (menno.smits) → nobody
Revision history for this message
Ian Booth (wallyworld) wrote :

I'm marking this as Fix Committed on the basis that the fix for bug 1457645 may have fixed this also. Please re-open if the issue re-appears.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.