juju upgrade failures
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Expired
|
Critical
|
Unassigned |
Bug Description
This is a meta bug to capture the results of analysis RT 85463
https:/
Separate bugs will likely be opened to cover individual fixes.
-------
* Test upgrade path 1.20.14 -> 1.24.6
* Case 1: without ignore-
- Redeploying staging with standard HA cloud, VIP in low IP range
- Locally upgrade to 1.24.6 via apt
- Upgrade agents
$ juju get-env ignore-
ERROR key "ignore-
"bootstack-staging" environment.
$ juju set-env tools-url=https:/
$ juju upgrade-juju --version="1.24.6"
- Upgrade did complete, but most agents wouldn't be upgraded, ie.
are left on 1.20.14
- Unrecoverable hook errors
Initial analysis:
- unit agents never received upgrade notification
- host machine never upgraded
- tools could not be retrieved
2015-10-15 12:33:03 ERROR juju.worker.
Looking on the state server machine 0 where the above request is processed:
2015-10-15 12:33:00 DEBUG juju.apiserver apiserver.go:257 <- [5B6] machine-0-lxc-14 {"RequestId"
2015-10-15 12:33:00 DEBUG juju.apiserver apiserver.go:271 -> [5B6] machine-0-lxc-14 2.633245ms {"RequestId"
2015-10-15 12:33:00 DEBUG juju.apiserver utils.go:71 validate env uuid: state server environment - 5c8be479-
2015-10-15 12:33:00 ERROR juju.apiserver tools.go:59 GET(/environmen
2015-10-15 12:33:00 DEBUG juju.apiserver tools.go:119 sending error: 400 failed to open GridFS file "abafdc81-
2015-10-15 12:33:00 DEBUG juju.apiserver apiserver.go:257 <- [5D1] machine-1-lxc-11 {"RequestId"
This implies that the underlying Juju blobstore has become corrupt - somehow a previously stored tools blob is not there.
Looking further up the log file:
2015-10-15 12:19:30 DEBUG juju.apiserver apiserver.go:271 -> [1B] machine-0 1.451437574s {"RequestId"
2015-10-15 12:19:30 DEBUG juju.cmd.jujud machine.go:1604 worker "certupdater" exited with retrieving initial server addesses: EOF
2015-10-15 12:19:30 INFO juju.worker runner.go:275 stopped "certupdater", err: retrieving initial server addesses: EOF
2015-10-15 12:19:30 DEBUG juju.worker runner.go:203 "certupdater" done: retrieving initial server addesses: EOF
2015-10-15 12:19:30 INFO juju.cmd.jujud util.go:139 error pinging *state.State: EOF
...
...
2015-10-15 12:19:33 ERROR juju.cmd.jujud util.go:217 closeWorker: close error: closing state failed: error stopping transaction watcher: watcher iteration error: EOF
2015-10-15 12:19:33 INFO juju.worker runner.go:275 stopped "state", err: retrieving initial server addesses: EOF
2015-10-15 12:19:33 DEBUG juju.worker runner.go:203 "state" done: retrieving initial server addesses: EOF
2015-10-15 12:19:33 ERROR juju.worker runner.go:223 exited "state": retrieving initial server addesses: EOF
2015-10-15 12:19:33 INFO juju.worker runner.go:261 restarting "state" in 3s
2015-10-15 12:19:33 DEBUG juju.storage managedstorage.
2015-10-15 12:19:33 ERROR juju.apiserver tools.go:59 GET(/environmen
2015-10-15 12:19:33 DEBUG juju.apiserver tools.go:119 sending error: 400 error fetching tools: error caching tools: cannot store tools tarball: cannot add resource "environs/
So the entire Juju model and blobstore mongo databases have become corrupt.
Need to look into why.
As an aside, unit agents are unnecessarily bouncing:
ceilometer-
2015-10-15 12:33:26 INFO juju.worker.uniter uniter.go:144 unit "ceilometer-
2015-10-15 12:33:26 ERROR juju.worker.
2015-10-15 12:33:26 DEBUG juju.worker.uniter runlistener.go:97 juju-run listener stopping
2015-10-15 12:33:26 DEBUG juju.worker.uniter runlistener.go:117 juju-run listener stopped
2015-10-15 12:33:26 ERROR juju.worker runner.go:218 exited "uniter": ModeAbide: cannot set invalid status "started"
2015-10-15 12:33:26 INFO juju.worker runner.go:252 restarting "uniter" in 3s
The restart is unfortunate - it is due to an unrecognised status value "started" being set - this should not cause an agent restart.
-------
* Test upgrade path 1.22.6 -> 1.24.6 with ignore-
- Redeploying staging with standard HA cloud, VIP in low IP range
- Locally upgrade to 1.24.6 via apt
- Upgrade agents
$ juju set-env ignore-
$ juju upgrade-juju --version="1.24.6"
- After a few mins, upgrade is done
- Errors:
+ many hook errors
+ some units have not been upgraded
+ some units don't seem to have a public address set at all
$ juju ssh mysql/0
ERROR unit "mysql/0" has no internal address
Download signature.asc
application/
Initial analysis:
- logs show units in error have failed to run config changed hook due to
2015-10-19 09:44:13 DEBUG worker.uniter.jujuc server.go:159 hook context id "ceilometer/
2015-10-19 09:44:13 INFO config-changed error: private-address not set
2015-10-19 09:44:13 INFO config-changed Traceback (most recent call last):
2015-10-19 09:44:13 INFO config-changed File "/var/lib/
2015-10-19 09:44:13 INFO config-changed hooks.execute(
When ignore-
* Test upgrade path 1.20.14 -> 1.24.6
* Case 2: with ignore-
- Redeploying staging with standard HA cloud, VIP in low IP range
- Locally upgrade to 1.24.6 via apt
- Upgrade agents
$ juju set-env ignore-
$ juju set-env tools-url=https:/
$ juju upgrade-juju --version="1.24.6"
- Upgrade completes (ie., apiserver accepts connections again)
after a few mins
- Errors:
+ hook errors on 56 of 102 units
+ 5 units have not been upgraded
+ some units don't seem to have a public address set at all
$ juju ssh mysql/0
ERROR unit "mysql/0" has no internal address
Initial analysis:
Logs show same root cause as the previous 1.22 -> 1.24 upgrade.
- machine private address being reset.
Changed in juju-core: | |
assignee: | nobody → Michael Foord (mfoord) |
Changed in juju-core: | |
milestone: | 1.26-alpha1 → 1.26-alpha2 |
Changed in juju-core: | |
assignee: | Horacio Durán (hduran-8) → Wayne Witzel III (wwitzel3) |
Changed in juju-core: | |
status: | Triaged → In Progress |
Changed in juju-core: | |
milestone: | 1.26-alpha2 → 1.26-beta1 |
Changed in juju-core: | |
assignee: | Wayne Witzel III (wwitzel3) → nobody |
I had a long look at the logs and talked with the bootstack people originally reporting the error, to have certainty we will need the db (which will be provided tomorrow) and also some stats of the machine while running the upgrade, such as the free ram.
I have a couple of working theories:
From matching the error lines from the EOF appearances with the files in the actual code, the agent seems to be 1.20 at that point.
I think it could be either:
Not enough memory.
Agent not being able to authenticate
Related to the previous one, db is upgraded but agent still runs old version.
I cannot say more without a closer inspection of the db given the old version we are departing from.