Comment 3 for bug 550792

Revision history for this message
Anthony Lenton (elachuni) wrote :

Notes for QA:
Until we have a second database setup on staging, as soon as we drop the master database the whole site will become unavailable. Not much can be tested during an automatic database failover, but we *can* test that the site recovers automatically once the database comes back, if it does so within the time the site is trying to reconnect.

So, steps to test:
1. Check that the site is generally available
2. Ask IS to disable the DB monentarily. Stopping postgresql altogether should do. If this is not an option, we can wait until there's a scheduled LP staging rollout, as the DB should be out for a while for each of those.
3. Check that the site is unavailable during the database outage. Ideally here we'd be switched over to readonly mode, but that's only possible with at least one slave database.
4. Bring the database back up.
5. Check that the site becomes available again without manual intervention. It could take a short while to attempt to reconnect, depending on how much time the database was unavailable. One rough (generous!) estimate is that it shouldn't take longer to come back than the total amount of time that the database was unavailable.

If the database doesn't reconnect automatically, verify that it wasn't down long enough to have exhausted all reconnection attempts. This time is sum(DBRECOVER_INTERVAL * DBRECOVER_MULTIPLIER ** x for x in range(DBRECOVER_ATTEMPTS)). As of now, with the current settings for staging (DBRECOVER_INTERVAL=15, DBRECOVER_MULTIPLIER=2, DBRECOVER_ATTEMPTS=10), this time is around 4 hours 15'.