Canonical SSO provider

Failover backend should recover automatically

Bug #550792 reported by Tom Haddon on 2010-03-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical SSO provider	Fix Released	Medium	Anthony Lenton	Canonical SSO provider 2.7.0

Bug Description

The failover backend currently requires a manual reset once it is triggered. This is causing a problem when we see temporary non-availability of the database (restarts, intermittent network outages, etc) so IS have asked us to implement an automatic recovery mechanism.

There should be "some kind of exponential backoff" to help avoid flapping which will eventually fail permanently and require manual recovery.

We should also be able to force a state where manual recovery is required so we can manually switch to read-only mode for maintenance.

All failover state changes should be logged for diagnosis (oops) and we should be notified (nagios) of state change.

The various failover and recovery conditions should be configurable including the ability to disable automatic recovery.

IS will assist us with setting up failure conditions on staging for testing.

Current behaviour is documented here: https://wiki.canonical.com/InformationInfrastructure/ISD/Docs/SSO/Failover . That page should be updated with the new behaviours before this bug is closed.

Testcase ISD_161

See original description

Tags:

Stuart Metcalfe (stuartmetcalfe) on 2010-03-31

summary:	- Staging login.ubuntu.com service doesn't deal gracefully with staging DB - updates + Failover backend should recover automatically
description:	updated

Stuart Metcalfe (stuartmetcalfe) on 2010-04-22

Changed in canonical-identity-provider:
milestone:	none → 2.5.0

Anthony Lenton (elachuni) on 2010-05-03

Changed in canonical-identity-provider:
status:	New → Confirmed
importance:	Undecided → Medium

Anthony Lenton (elachuni) on 2010-05-04

Changed in canonical-identity-provider:
milestone:	2.5.0 → 2.6.0

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-isd

Stuart Metcalfe (stuartmetcalfe) on 2010-06-03

Changed in canonical-identity-provider:
milestone:	2.6.0 → 2.7.0

Anthony Lenton (elachuni) on 2010-06-17

tags:

added: 2-sp

Anthony Lenton (elachuni) on 2010-07-08

Changed in canonical-identity-provider:
assignee:	nobody → Anthony Lenton (elachuni)
status:	Confirmed → In Progress

Anthony Lenton (elachuni) on 2010-07-14

Changed in canonical-identity-provider:
status:	In Progress → Fix Committed

Revision history for this message

Dave Morley (davmor2) wrote on 2010-07-19:

Passes on EC2

Anthony created some scripts to break the DB. Restoring it ended the readonly mode.

Changed in canonical-isd-qa:
status:	New → Confirmed
assignee:	nobody → Dave Morley (davmor2)

Revision history for this message

Dave Morley (davmor2) wrote on 2010-07-19:

The readonly screen also tells you why the system is down and how long till the next check.

Revision history for this message

Anthony Lenton (elachuni) wrote on 2010-07-21:

Notes for QA:
Until we have a second database setup on staging, as soon as we drop the master database the whole site will become unavailable. Not much can be tested during an automatic database failover, but we *can* test that the site recovers automatically once the database comes back, if it does so within the time the site is trying to reconnect.

So, steps to test:
1. Check that the site is generally available
2. Ask IS to disable the DB monentarily. Stopping postgresql altogether should do. If this is not an option, we can wait until there's a scheduled LP staging rollout, as the DB should be out for a while for each of those.
3. Check that the site is unavailable during the database outage. Ideally here we'd be switched over to readonly mode, but that's only possible with at least one slave database.
4. Bring the database back up.
5. Check that the site becomes available again without manual intervention. It could take a short while to attempt to reconnect, depending on how much time the database was unavailable. One rough (generous!) estimate is that it shouldn't take longer to come back than the total amount of time that the database was unavailable.

If the database doesn't reconnect automatically, verify that it wasn't down long enough to have exhausted all reconnection attempts. This time is sum(DBRECOVER_INTERVAL * DBRECOVER_MULTIPLIER ** x for x in range(DBRECOVER_ATTEMPTS)). As of now, with the current settings for staging (DBRECOVER_INTERVAL=15, DBRECOVER_MULTIPLIER=2, DBRECOVER_ATTEMPTS=10), this time is around 4 hours 15'.

Revision history for this message

Dave Morley (davmor2) wrote on 2010-07-29:

Passes on staging, the readonly text appears but the system oops do to no db at all. The readonly text goes when the system comes back up though.

description:	updated
Changed in canonical-isd-qa:
status:	Confirmed → Fix Committed

Danny Tamez (zematynnad) on 2010-08-03

Changed in canonical-isd-qa:
milestone:	none → canonical-identity-provider+2.7.0

Ricardo Kirkner (ricardokirkner) on 2010-08-03

Changed in canonical-identity-provider:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.