juju is trying to reuse dead connections to mongodb
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
John A Meinel | ||
2.2 |
Fix Released
|
High
|
John A Meinel | ||
2.3 |
Fix Released
|
High
|
John A Meinel |
Bug Description
Hi,
This is all on a juju 2.2.6 ha-enabled controller (3 units).
We discovered this morning that there were 29M documents in the "juju.txns" collection. On 2 out of 3 controllers, which didn't get restarted for a while, juju was apparently trying, every second, to connect to mongo using a dead connection ("mongodb ip" is the IP of the mongodb primary) :
2017-11-28 11:18:48 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2017-11-28 11:18:49 INFO juju.worker runner.go:477 start "txnlog"
2017-11-28 11:18:49 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2017-11-28 11:18:50 INFO juju.worker runner.go:477 start "txnlog"
2017-11-28 11:18:50 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
As you can see, it's trying to re-use the same quadruplet srcIP/srcPort/
However, a "sudo ss -antp|grep 59783" returned 0 result, meaning, as I understand it, that the connection was dead.
This in turned prevented juju from purging txns, which led to the 29M records we now have (we're purging them with mgopurge as I write this).
juju should probably be more clever when re-using connections. Note that this situation could be due to a mongodb primary switch that happened a few days ago.
Thanks
Changed in juju: | |
milestone: | 2.3.1 → none |
Changed in juju: | |
milestone: | none → 2.3.2 |
Changed in juju: | |
status: | Fix Committed → Fix Released |
To put some specifics on it, the internal TXN log watcher is started via: (txnLogWorker, func() (worker.Worker, error) { New(st. getTxnLogCollec tion()) , nil
state/workers.go:
ws.StartWorker
return watcher.
})
But if you look at getTxnLogCollec tion: tion() *mgo.Collection { DB(jujuDB) .C(txnLogC)
func (st *State) getTxnLogCollec
return st.session.
}
That uses a static "State.session" member. And I cannot find anywhere that calls
\.session.Refresh()
If we wanted to do something like: tion() *mgo.Collection { DB(jujuDB) .C(txnLogC)
func (st *State) getTxnLogCollec
session := st.session.Copy()
return session.
}
We would still need to figure out the lifecycle of that connection, figure out who needs to close it, etc.
So it needs a bit of a rework internally to probably have the worker itself copy the session and/or refresh it if there is an error.
We're probably actually making this worse in 2.3 with the StatePool which ensures that we'll be using the same State objects for longer, and for more purposes. Which is good in the sense we don't redo lots of work, but bad in the sense that any fatal error ends up fatal forever.