mongo outage causes huge logging output
Bug #1744374 reported by
Roger Peppe
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
I just noticed this in a machine agent log:
ERROR juju.state.presence pruner.go:93 error removing old entries from "presence.pings": read tcp 172.31.
There were about 1.6 million such lines, being produced about 133 messages per second.
This was after a mongo outage, but I don't think Juju should be this verbose when that happens - it will fill up disk space very quickly (and this was on a controller with only WARNING level enabled).
I've attached a summary of the log lines, with the timestamps removed, sorted, uniq'ed and sorted in reverse frequency count.
The logs spanned about 3h30m.
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → Medium |
To post a comment you must log in.
So, we shouldn't actually be filling up disk very much, because
a) we can't talk to the database, so that isn't filling up
b) we should be rotating our log files to a maximum size.
That said, we don't want to be spamming the log file, but I'm less
concerned about that particular issue than it looks like we're trying to
re-use a closed connection.
Namely things like:
12096 ERROR juju.worker runner.go:392 exited "txnlog": read tcp 23.250: 56654-> 172.31. 23.250: 37017: i/o timeout
172.31.
12,000 lines using exactly the same source port. Now, there are 3000
different lines, so maybe its just a small pool that we pull from? I'm not
really sure.
We could throttle how much we try to reconnect to 'txnlog' which is
actually the only place with 10,000+ entries. Everything else seems to only
have <30.
John
=:->
On Fri, Jan 19, 2018 at 10:58 PM, Roger Peppe <email address hidden>
wrote:
> Public bug reported: 23.250: 56654-> 172.31. 23.250: 37017: i/o timeout log-messages. txt" /bugs.launchpad .net/bugs/ 1744374/ +attachment/ +files/ io-timeout- log-messages. txt /bugs.launchpad .net/bugs/ 1744374 /bugs.launchpad .net/juju/ +bug/1744374/ +subscriptions
>
> I just noticed this in a machine agent log:
>
> ERROR juju.state.presence pruner.go:93 error removing old entries
> from "presence.pings": read tcp
> 172.31.
>
> There were about 1.6 million such lines, being produced about 133
> messages per second.
>
> This was after a mongo outage, but I don't think Juju should be this
> verbose when that happens - it will fill up disk space very quickly (and
> this was on a controller with only WARNING level enabled).
>
> I've attached a summary of the log lines, with the timestamps removed,
> sorted, uniq'ed and sorted in reverse frequency count.
>
> The logs spanned about 3h30m.
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> ** Attachment added: "io-timeout-
> https:/
> 5039688/
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https:/
>
> Title:
> mongo outage causes huge logging output
>
> To manage notifications about this bug go to:
> https:/
>