Instance fails to start, race condition prevents guestagent stopping postgresql
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack DBaaS (Trove) |
New
|
Undecided
|
Unassigned |
Bug Description
This is master from a week or so ago (last commit 24 Jun). I'm running it via devstack. The host VM is Ubuntu bionic and the guest image is xenial with Postgres 9.6. The scenario is:
- create a database instance
- goes to FAILED status (Service not active, status: failed to spawn)
It hits the guestagent timeout waiting for shutdown (from guestagent log):
2019-07-25 03:06:08.872 ERROR trove.guestagen
2019-07-25 03:06:08.873 INFO trove.guestagen
2019-07-25 03:06:08.873 DEBUG trove.guestagen
2019-07-25 03:06:08.874 DEBUG trove.guestagen
2019-07-25 03:06:08.875 ERROR trove.guestagen
2019-07-25 03:06:08.955 ERROR oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
2019-07-25 03:06:08.955 TRACE oslo_messaging.
Looking at the state of the postgresql service, it thinks postgres is down:
ubuntu@
● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/
Active: inactive (dead) since Thu 2019-07-25 02:56:06 UTC; 1h 4min ago
Process: 1468 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 1468 (code=exited, status=0/SUCCESS)
Jul 25 02:56:02 db0 systemd[1]: Starting PostgreSQL RDBMS...
Jul 25 02:56:02 db0 systemd[1]: Started PostgreSQL RDBMS.
Jul 25 02:56:06 db0 systemd[1]: Stopped PostgreSQL RDBMS.
However pg_isready says it is up (which is correct):
buntu@db0:
/var/run/
My suspicion is that the guestagent is attempting to shut Postgres down just as the service is starting it - hence confusion and inconsistent resulting states.
I can 'fix' this by making the guestagent wait 100s before trying to stop postgres (see patch), however that is just a workaround and we need to figure out how to stop the guestagent trying to stop postgres before it is properly up.
Guestagent log