wal-e wal-push monitoring

Bug #1889697 reported by Haw Loeung
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PostgreSQL Charm

Bug Description


We've had WAL-e wal-push stuck for weeks and was only noticed by a monitoring check for disk predicted full in X days.

postgres 20209 0.0 0.0 142312 4248 ? Ss Apr15 25:35 \_ postgres: archiver process archiving 000000030000B78400000050
postgres 29424 0.0 0.0 4456 676 ? S Jul16 0:00 | \_ sh -c /var/lib/postgresql/venv/bin/envdir /etc/postgresql/10/main/wal-e.env /var/lib/postgresql/venv/bin/wal-e wal-push pg_wal/000000030000B78400000050
postgres 29425 0.0 0.0 132964 4200 ? S Jul16 0:00 | \_ /var/lib/postgresql/venv/bin/python3 /var/lib/postgresql/venv/bin/wal-e wal-push pg_wal/000000030000B78400000050
postgres 29431 0.0 0.0 0 0 ? Z Jul16 0:00 | \_ [lzop] <defunct>

I think the wal-e process and postgresql-charm should do two things:

 1. update a "last updated" file somewhere (touch file?) and ship out monitoring for when this wasn't recently updated.

 2. wrap wal-push around some 'timeout' so processes are killed off if it takes too long when the next run recovering where it left of.

Revision history for this message
Stuart Bishop (stub) wrote :

The number of .ready files in $PGDATA/pg_wal would be a good thing to monitor, and the age of the oldest .ready file. This will catch all stuck WAL archiving (not just WAL-E). Maybe in the telegraf subordinate, or maybe just have the PostgreSQL charm bring up its own Prometheus scrape target.

Changed in postgresql-charm:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.