PostgreSQL charm should detect when replication is failing

Bug #1930442 reported by Paul Goins
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PostgreSQL Charm
New
Undecided
Unassigned

Bug Description

I have a 2 node PostgreSQL cluster backing landscape-server, running on KVMs. I was preparing for a clean shutdown of the host node and so tried doing a switchover to make the master run on the machine on the host which was remaining online.

However, this failed.

Upon investigation, I determined that the previous master had been encountering replication errors for quite some time (days). Example:

2021-05-10 15:53:56 UTC [6790]: [3-1] db=[unknown],user=_juju_repl ERROR: requested timeline 4 is not in this server's history

After the attempted switchover, the intended new master unit (which was also the juju leader) ended up in an error state due to failing its replication-relation-changed hook. Checking logs from systemctl on that unit, here's an excerpt of what I found there:

May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: 2021-05-19 20:39:01 UTC [5744]: [1-1] db=[unknown],user=[unknown] LOG: connection received: host=[local]
May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: 2021-05-19 20:39:01 UTC [5744]: [2-1] db=postgres,user=postgres FATAL: the database system is starting up
May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: 2021-05-19 20:39:03 UTC [5758]: [1-1] db=,user= FATAL: could not start WAL streaming: ERROR: requested timeline 4 is not in this server's
May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: 2021-05-19 20:39:08 UTC [5760]: [1-1] db=,user= FATAL: could not start WAL streaming: ERROR: requested timeline 4 is not in this server's
May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: 2021-05-19 20:39:13 UTC [5771]: [1-1] db=,user= FATAL: could not start WAL streaming: ERROR: requested timeline 4 is not in this server's
May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: 2021-05-19 20:39:18 UTC [5779]: [1-1] db=,user= FATAL: could not start WAL streaming: ERROR: requested timeline 4 is not in this server's
May 19 20:39:18 landscapesql-2 postgresql@10-main[905]: pg_ctl: server did not start in time

However, before I performed any maintenance here, "juju status" didn't indicate any problems, nor did nagios alert us for any issues. "juju status" for those units looked like this:

landscape-postgresql/0 active idle 3 10.192.1.9 5432/tcp Live master (10.16)
landscape-postgresql/1* active idle 8 10.192.1.10 5432/tcp Live secondary (10.16)

The charm deployed was cs:postgresql-208, on top of Bionic.

I think the charm should do something to detect replication failures, so as to avoid this type of issue.

Revision history for this message
Paul Goins (vultaire) wrote :

While I think this is tangential to the bug's intended issue (lack of alerting or "juju status" indication of when replication is not healthy), if you wish to know the exact context of the Juju error encountered, here is a traceback from the failed replication-relation-changed hook:

2021-06-01 16:11:04 ERROR juju-log replication:3: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-landscape-postgresql-1/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-landscape-postgresql-1/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-landscape-postgresql-1/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-landscape-postgresql-1/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-landscape-postgresql-1/charm/reactive/postgresql/replication.py", line 635, in drain_master_and_promote_anointed
    local_offset = postgresql.wal_received_offset(postgresql.connect())
  File "/var/lib/juju/agents/unit-landscape-postgresql-1/charm/reactive/postgresql/postgresql.py", line 130, in connect
    return psycopg2.connect(user=user, database=database, host=host, port=port_)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: the database system is starting up

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.