juju-core

Bug #1463480
Comment #16

Comment 16 for bug 1463480

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-11:

#16

Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928. I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address. They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).

However, on machine-0-lxc-0, the watcher's connection to the state server dies before it gets the update:
2015-06-09 10:48:25 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64": Get https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 10:48:30 INFO juju.worker.upgrader upgrader.go:134 upgrade requested from 1.20.14.1-trusty-amd64 to 1.22.1
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
...
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-09 11:05:01 INFO juju.cmd.jujud agent.go:177 error pinging *api.State: connection is shut down
2015-06-09 11:05:01 ERROR juju.worker runner.go:207 fatal "upgrader": error receiving message: read tcp 172.20.168.4:17070: connection timed out

Since machine-lxc-0 is still running 1.20.14, it doesn't filter out the 10.0.3.1 addresses when it tries to reconnect to the state servers:
2015-06-09 11:05:01 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-09 11:05:04 INFO juju.worker runner.go:260 start "api"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"

At this point, it never reconnects to the state servers because it's using the wrong IP. The fix for this would be the fix that's been released for bug #1416928.

Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928.  I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address.  They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).

At this point, it never reconnects to the state servers because it's using the wrong IP.  The fix for this would be the fix that's been released for bug #1416928.