Canonical Juju

juju is trying to reuse dead connections to mongodb

Bug #1734903 reported by Junien Fridrick on 2017-11-28

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.4-beta1
2.2	Fix Released	High	John A Meinel	Canonical Juju 2.2.7
2.3	Fix Released	High	John A Meinel	Canonical Juju 2.3.2

Bug Description

Hi,

This is all on a juju 2.2.6 ha-enabled controller (3 units).

We discovered this morning that there were 29M documents in the "juju.txns" collection. On 2 out of 3 controllers, which didn't get restarted for a while, juju was apparently trying, every second, to connect to mongo using a dead connection ("mongodb ip" is the IP of the mongodb primary) :

2017-11-28 11:18:48 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2017-11-28 11:18:49 INFO juju.worker runner.go:477 start "txnlog"
2017-11-28 11:18:49 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2017-11-28 11:18:50 INFO juju.worker runner.go:477 start "txnlog"
2017-11-28 11:18:50 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 INFO juju.worker runner.go:467 restarting "txnlog" in 1s

As you can see, it's trying to re-use the same quadruplet srcIP/srcPort/dstIP/dstPort, meaning the same TCP connection.

However, a "sudo ss -antp|grep 59783" returned 0 result, meaning, as I understand it, that the connection was dead.

This in turned prevented juju from purging txns, which led to the 29M records we now have (we're purging them with mgopurge as I write this).

juju should probably be more clever when re-using connections. Note that this situation could be due to a mongodb primary switch that happened a few days ago.

Thanks

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

To put some specifics on it, the internal TXN log watcher is started via:
state/workers.go:
ws.StartWorker(txnLogWorker, func() (worker.Worker, error) {
return watcher.New(st.getTxnLogCollection()), nil
})

But if you look at getTxnLogCollection:
func (st *State) getTxnLogCollection() *mgo.Collection {
return st.session.DB(jujuDB).C(txnLogC)
}

That uses a static "State.session" member. And I cannot find anywhere that calls
\.session.Refresh()

If we wanted to do something like:
func (st *State) getTxnLogCollection() *mgo.Collection {
session := st.session.Copy()
return session.DB(jujuDB).C(txnLogC)
}
We would still need to figure out the lifecycle of that connection, figure out who needs to close it, etc.

So it needs a bit of a rework internally to probably have the worker itself copy the session and/or refresh it if there is an error.

We're probably actually making this worse in 2.3 with the StatePool which ensures that we'll be using the same State objects for longer, and for more purposes. Which is good in the sense we don't redo lots of work, but bad in the sense that any fatal error ends up fatal forever.

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.3.1
status:	New → Triaged

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

(note that this should probably be considered if/when we work on sharing the underlying log watcher across all models.)

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

a potential "cheap" fix for this would be to do something like:

err := st.session.Ping()
if err != nil {
// failed to talk to the database, try refreshing our connection
st.session.Refresh()
}
return st.session.DB(jujuDB).C(txnLogC)

It might even be reasonable to do a Refresh more often, as if the txnlog watcher is restarting, we probably have reason to suspect the connection.

Canonical Juju QA Bot (juju-qa-bot) on 2017-12-08

Changed in juju:
milestone:	2.3.1 → none

Tim Penhey (thumper) on 2017-12-10

Changed in juju:
milestone:	none → 2.3.2

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-13:

https://github.com/juju/juju/pull/8216
is targeting 2.2 for this fix. We may decide to only land it in 2.3+

Changed in juju:
milestone:	2.3.2 → none
milestone:	none → 2.4-beta1
assignee:	nobody → John A Meinel (jameinel)
status:	Triaged → In Progress

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-22:

As the PR from comment # 4 has been merged in 2.2, it has since been Released in 2.2.7 and 2.3.2.

I will also mark it as Fix Committed in 2.4-beta1 as I am sure that it was merged into develop which heads into 2.4.

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2018-07-10

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.