Hi,
This was encountered on cs:postgresql-208. I'm not sure how frequent of an occurance this may be, or whether it might already be repaired by a newer version (I see rev 233 is available), but I wanted to share the finding just in case.
I recently had to restart some PostgreSQL units for applying package upgrades which required reboots. The cluster is a 2 node cluster for supporting Landscape.
I used the switchover action to change the master to whichever node was going to remain online. The first time I did this, things seemed to work fine. The second time I did this, apparently it did not go well.
I've analyzed logs, and if I'm reading this correctly, it looks like:
* The demoted box didn't restart into read only mode for about 45 minutes after the switchover action occurred.
* However, the promoted box restarted into read/write mode within about 18 seconds.
* This seems to indicate that both boxes may have been in read/write mode for some time, resulting in replication errors once the demoted box finally did restart.
I've provided my analysis below; I hope this is helpful.
Best Regards,
Paul Goins
----
First attempt: switchover from landscape-postgresql/0 to /1 (apparently successful)
juju: 2021-05-03T20:26:17Z: juju run -m lma landscape-postgresql/leader switchover master=landscape-postgresql/1
landscape-postgresql/0: 2021-05-03 20:26:45 UTC [1183]: [17026-1] db=,user= LOG: database system is shut down
2021-05-03 20:26:46 UTC [13363]: [4-1] db=,user= LOG: database system is ready to accept read only connections
landscape-postgresql/1: 2021-05-03 20:27:05 UTC [1507]: [8-1] db=,user= LOG: received promote request
2021-05-03 20:27:06 UTC [1396]: [16986-1] db=,user= LOG: database system is ready to accept connections
landscape-postgresql/0: Some later restarts occurred here that I was not aware of; not sure regarding the cause, but noting:
2021-05-03 22:10:10 UTC [13363]: [7-1] db=,user= LOG: database system is shut down
2021-05-03 22:11:44 UTC [4805]: [1-1] db=,user= LOG: database system was shut down in recovery at 2021-05-03 22:10:10 UTC
2021-05-03 22:26:20 UTC [4770]: [7-1] db=,user= LOG: database system is shut down
2021-05-03 22:29:53 UTC [1402]: [1-1] db=,user= LOG: database system was shut down in recovery at 2021-05-03 22:26:20 UTC
----------------------------------------------------------------------
Second attempt: switchover from landscape-postgresql/1 to /0 (not successful; /1's restart delayed, replication errors occurred)
juju: 2021-05-04T16:06:22Z: juju run -m lma landscape-postgresql/leader switchover master=landscape-postgresql/0
landscape-postgresql/0: 2021-05-04 16:06:40 UTC [1402]: [6-1] db=,user= LOG: received promote request
2021-05-04 16:06:40 UTC [1235]: [5-1] db=,user= LOG: database system is ready to accept connections
landscape-postgresql/1: 2021-05-04 16:06:40 UTC [16393]: [3-1] db=[unknown],user=_juju_repl LOG: disconnection: session time: 17:36:44.912 user=_juju_repl database= host=10.192.1.9 port=52284
[lots more disconnections]
2021-05-04 16:50:54 UTC [1396]: [17237-1] db=,user= LOG: database system is shut down
2021-05-04 16:52:03 UTC [19434]: [1-1] db=,user= LOG: database system was shut down at 2021-05-04 16:50:53 UTC
2021-05-04 16:52:03 UTC [19385]: [4-1] db=,user= LOG: database system is ready to accept read only connections
ERRORS START OCCURRING:
landscape-postgresql/0: 2021-05-04 16:52:03 UTC [26310]: [1-1] db=[unknown],user=[unknown] LOG: connection received: host=10.192.1.10 port=54244
2021-05-04 16:52:03 UTC [26310]: [2-1] db=[unknown],user=_juju_repl LOG: replication connection authorized: user=_juju_repl SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, compression=off)
2021-05-04 16:52:03 UTC [26310]: [3-1] db=[unknown],user=_juju_repl ERROR: requested starting point D8/D2000000 on timeline 2 is not in this server's history
2021-05-04 16:52:03 UTC [26310]: [4-1] db=[unknown],user=_juju_repl DETAIL: This server's history forked from timeline 2 at D8/D1270B50.
2021-05-04 16:52:03 UTC [26310]: [5-1] db=[unknown],user=_juju_repl LOG: disconnection: session time: 0:00:00.015 user=_juju_repl database= host=10.192.1.10 port=54244
(Repeats with different PIDs)
landscape-postgresql/1:
2021-05-04 16:52:03 UTC [19438]: [1-1] db=,user= LOG: fetching timeline history file for timeline 3 from primary server
2021-05-04 16:52:03 UTC [19438]: [2-1] db=,user= FATAL: could not start WAL streaming: ERROR: requested starting point D8/D2000000 on timeline 2 is not in this server's history
DETAIL: This server's history forked from timeline 2 at D8/D1270B50.
(repeats a lot)
2021-05-04 16:52:03 UTC [19385]: [4-1] db=,user= LOG: database system is ready to accept read only connections
2021-05-04 17:22:14 UTC [19385]: [7-1] db=,user= LOG: database system is shut down
2021-05-04 17:25:39 UTC [1451]: [1-1] db=,user= LOG: database system was shut down in recovery at 2021-05-04 17:22:14 UTC
(retries startup, keeps failing because of the same error though.)
I'm not so sure about my judgment. Take it with a grain of salt.
Another likely useful bit of info is "juju show-status-log".
Here's output from landscape- postgresql/ 0: https:/ /pastebin. ubuntu. com/p/6Cp49CwcW b/ postgresql/ 1: https:/ /pastebin. ubuntu. com/p/F926MCyjt H/
And landscape-
Still not sure how we ended up with the timeline errors though...