restore-backup SASL auth failure when machine 0 of ha removed

Bug #1771657 reported by Peter Matulis
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

I am experiencing errors when using restore-backup with a degraded HA cluster.

https://paste.ubuntu.com/p/vB6Kxy9hJF/

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

restore in an ha context is no longer allowed per: https://github.com/juju/juju/pull/8783 to restore, remove controller machines until you're in a single controller context, restore the backup and run enable-ha again.

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
status: New → Triaged
milestone: none → 2.4.0
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@pmatulis, can you reproduce the "server returned error on SASL authentication step" errors in an non ha environment?

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Peter Matulis (petermatulis) wrote :

Yes.

I tried to restore (to the now single controller) with a backup made of the cluster.

http://paste.ubuntu.com/p/NdXVB87SG4/

Revision history for this message
Peter Matulis (petermatulis) wrote :

Trying to restore with a pre enable-ha backup also failed with the same error. I've included --debug this time:

http://paste.ubuntu.com/p/DBpjG7SK3x/

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Some of the prior attempts were using a juju version before 2.4-rc1. I was able to reproduce the SASL cert issue as show here: https://paste.ubuntu.com/p/MdDdxwW549/

One of the machines removed must be machine 0

~$ juju restore-backup --file juju-backup-20180612-202828.tar.gz
ERROR could not clean up after failed restore attempt: cannot complete restore: <nil>: juju restore is in progress - API is disabled to prevent data loss
ERROR cannot perform restore: <nil>: restore failed: error restoring state from backup: setting special user permission in db: server returned error on SASL authentication step: Authentication failed.

summary: - restore-backup doesn't work with a degraded HA cluster
+ restore-backup SASL auth failure when machine 0 of ha removed
Changed in juju:
status: Incomplete → Triaged
importance: Undecided → High
Changed in juju:
milestone: 2.4.0 → none
assignee: Heather Lanigan (hmlanigan) → nobody
tags: added: backup-restore
Revision history for this message
Alexander Litvinov (alitvinov) wrote :

Field is seeing exactly this issue on pre-cloud-handover testing on customer site.

$ juju --version
2.6.3-disco-amd64

Steps:
1) juju bootstrap
2) juju enable-ha
After it settles all 3 ha-enabled
3) juju create-backup -m controller
4) juju remove-machine --force 0 1
After they are gone (~2 min)
5) juju restore-backup -m controller --file juju-backup.tar.gz
ERROR could not clean up after failed restore attempt: cannot complete restore: <nil>: Restore did not finish successfully
ERROR cannot perform restore: <nil>: restore failed: error restoring state from backup: setting special user permission in db: server returned error on SASL authentication step: Authentication failed.

Issue also 100% reproduced on AWS provider.

Juju backup-restore documentation says:
"We begin by removing machines ‘1’ and ‘2’ but you can remove any two:
juju remove-machine -m aws:controller 1 2"

https://docs.jujucharms.com/controller-backups

However restore after removing 0 and 1 does not work, after removing 1 and 2 - works fine.

After trying multiple times it falls into this status

$ juju show-controller
juju restore is in progress - API is disabled to prevent data loss
juju restore is in progress - API is disabled to prevent data loss
{}

restore with --debug:
https://pastebin.canonical.com/p/jmJJ9rKJGG/

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

subscribed ~field-medium

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

After giving it a second thought subscribing ~field-high as it's a major documented functionality not available

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I can reproduce the problem steadily according to the steps - https://paste.ubuntu.com/p/tXpm4nDwsP/

tags: added: sts
Tim Penhey (thumper)
tags: added: paris
Revision history for this message
John A Meinel (jameinel) wrote :

And interesting part from the pastebin:
20:19:20 DEBUG juju.rpc server.go:329 error closing codec: write tcp 192.168.1.157:46182->3.80.228.179:17070: i/o timeout
ERROR cannot perform restore: <nil>: restore failed: error restoring state from backup: setting special user permission in db: server returned error on SASL authentication step: Authentication failed.
20:19:20 DEBUG cmd supercommand.go:496 error stack:
restore failed: error restoring state from backup: setting special user permission in db: server returned error on SASL authentication step: Authentication failed.

That makes me think 'restore' is trying to login to Mongo as the "admin" user, rather than as a machine-agent. (eg, login as machine-2 when restoring on machine 2.)

I believe "juju bootstrap" sets up "admin" credentials on the initial database, but "enable-ha" only sets up machine agent credentials on the other machines. (I could be wrong, as it seems like users should be replicated in HA anyway, but maybe there is a local-admin.)

tags: added: restore-backup
Ian Booth (wallyworld)
tags: removed: restore-backup
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.