Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928. I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address. They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).
However, on machine-0-lxc-0, the watcher's connection to the state server dies before it gets the update:
2015-06-09 10:48:25 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64": Get https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 10:48:30 INFO juju.worker.upgrader upgrader.go:134 upgrade requested from 1.20.14.1-trusty-amd64 to 1.22.1
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
...
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-09 11:05:01 INFO juju.cmd.jujud agent.go:177 error pinging *api.State: connection is shut down
2015-06-09 11:05:01 ERROR juju.worker runner.go:207 fatal "upgrader": error receiving message: read tcp 172.20.168.4:17070: connection timed out
Since machine-lxc-0 is still running 1.20.14, it doesn't filter out the 10.0.3.1 addresses when it tries to reconnect to the state servers:
2015-06-09 11:05:01 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-09 11:05:04 INFO juju.worker runner.go:260 start "api"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"
At this point, it never reconnects to the state servers because it's using the wrong IP. The fix for this would be the fix that's been released for bug #1416928.
Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928. I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address. They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).
However, on machine-0-lxc-0, the watcher's connection to the state server dies before it gets the update: upgrader upgrader.go:157 failed to fetch tools from "https:/ /10.0.3. 1:17070/ environment/ c0b9fa19- 1546-4fad- 8bd9-06f8926f71 7c/tools/ 1.22.1- trusty- amd64": Get https:/ /10.0.3. 1:17070/ environment/ c0b9fa19- 1546-4fad- 8bd9-06f8926f71 7c/tools/ 1.22.1- trusty- amd64: dial tcp 10.0.3.1:17070: connection timed out upgrader upgrader.go:134 upgrade requested from 1.20.14. 1-trusty- amd64 to 1.22.1 api.watcher watcher.go:68 error trying to stop watcher: connection is shut down api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-09 10:48:25 ERROR juju.worker.
2015-06-09 10:48:30 INFO juju.worker.
2015-06-09 11:05:01 ERROR juju.state.
...
2015-06-09 11:05:01 ERROR juju.state.
2015-06-09 11:05:01 INFO juju.cmd.jujud agent.go:177 error pinging *api.State: connection is shut down
2015-06-09 11:05:01 ERROR juju.worker runner.go:207 fatal "upgrader": error receiving message: read tcp 172.20.168.4:17070: connection timed out
Since machine-lxc-0 is still running 1.20.14, it doesn't filter out the 10.0.3.1 addresses when it tries to reconnect to the state servers: 10.0.3. 1:17070/ " 10.0.3. 1:17070/ " 10.0.3. 1:17070/ " 10.0.3. 1:17070/ ": websocket.Dial wss://10. 0.3.1:17070/ : dial tcp 10.0.3.1:17070: connection timed out 10.0.3. 1:17070/ ": websocket.Dial wss://10. 0.3.1:17070/ : dial tcp 10.0.3.1:17070: connection timed out 10.0.3. 1:17070/ ": websocket.Dial wss://10. 0.3.1:17070/ : dial tcp 10.0.3.1:17070: connection timed out 10.0.3. 1:17070/ "
2015-06-09 11:05:01 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-09 11:05:04 INFO juju.worker runner.go:260 start "api"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://
At this point, it never reconnects to the state servers because it's using the wrong IP. The fix for this would be the fix that's been released for bug #1416928.