Failover backend should recover automatically

Bug #550792 reported by Tom Haddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical SSO provider
Fix Released
Medium
Anthony Lenton

Bug Description

The failover backend currently requires a manual reset once it is triggered. This is causing a problem when we see temporary non-availability of the database (restarts, intermittent network outages, etc) so IS have asked us to implement an automatic recovery mechanism.

There should be "some kind of exponential backoff" to help avoid flapping which will eventually fail permanently and require manual recovery.

We should also be able to force a state where manual recovery is required so we can manually switch to read-only mode for maintenance.

All failover state changes should be logged for diagnosis (oops) and we should be notified (nagios) of state change.

The various failover and recovery conditions should be configurable including the ability to disable automatic recovery.

IS will assist us with setting up failure conditions on staging for testing.

Current behaviour is documented here: https://wiki.canonical.com/InformationInfrastructure/ISD/Docs/SSO/Failover . That page should be updated with the new behaviours before this bug is closed.

Testcase ISD_161

summary: - Staging login.ubuntu.com service doesn't deal gracefully with staging DB
- updates
+ Failover backend should recover automatically
description: updated
Changed in canonical-identity-provider:
milestone: none → 2.5.0
Changed in canonical-identity-provider:
status: New → Confirmed
importance: Undecided → Medium
Changed in canonical-identity-provider:
milestone: 2.5.0 → 2.6.0
Tom Haddon (mthaddon)
tags: added: canonical-losa-isd
Changed in canonical-identity-provider:
milestone: 2.6.0 → 2.7.0
tags: added: 2-sp
Changed in canonical-identity-provider:
assignee: nobody → Anthony Lenton (elachuni)
status: Confirmed → In Progress
Changed in canonical-identity-provider:
status: In Progress → Fix Committed
Revision history for this message
Dave Morley (davmor2) wrote :

Passes on EC2

Anthony created some scripts to break the DB. Restoring it ended the readonly mode.

Changed in canonical-isd-qa:
status: New → Confirmed
assignee: nobody → Dave Morley (davmor2)
Revision history for this message
Dave Morley (davmor2) wrote :

The readonly screen also tells you why the system is down and how long till the next check.

Revision history for this message
Anthony Lenton (elachuni) wrote :

Notes for QA:
Until we have a second database setup on staging, as soon as we drop the master database the whole site will become unavailable. Not much can be tested during an automatic database failover, but we *can* test that the site recovers automatically once the database comes back, if it does so within the time the site is trying to reconnect.

So, steps to test:
1. Check that the site is generally available
2. Ask IS to disable the DB monentarily. Stopping postgresql altogether should do. If this is not an option, we can wait until there's a scheduled LP staging rollout, as the DB should be out for a while for each of those.
3. Check that the site is unavailable during the database outage. Ideally here we'd be switched over to readonly mode, but that's only possible with at least one slave database.
4. Bring the database back up.
5. Check that the site becomes available again without manual intervention. It could take a short while to attempt to reconnect, depending on how much time the database was unavailable. One rough (generous!) estimate is that it shouldn't take longer to come back than the total amount of time that the database was unavailable.

If the database doesn't reconnect automatically, verify that it wasn't down long enough to have exhausted all reconnection attempts. This time is sum(DBRECOVER_INTERVAL * DBRECOVER_MULTIPLIER ** x for x in range(DBRECOVER_ATTEMPTS)). As of now, with the current settings for staging (DBRECOVER_INTERVAL=15, DBRECOVER_MULTIPLIER=2, DBRECOVER_ATTEMPTS=10), this time is around 4 hours 15'.

Revision history for this message
Dave Morley (davmor2) wrote :

Passes on staging, the readonly text appears but the system oops do to no db at all. The readonly text goes when the system comes back up though.

description: updated
Changed in canonical-isd-qa:
status: Confirmed → Fix Committed
Danny Tamez (zematynnad)
Changed in canonical-isd-qa:
milestone: none → canonical-identity-provider+2.7.0
Changed in canonical-identity-provider:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.