It looks like 0-lxc-0 lost its connection to state during the upgrade to 1.22. It tried to download tools from a location that didn't work and didn't try anywhere else. 0-lxc-1 encountered the same problem, but then tried another tools location. I haven't worked out why there is a difference between the two containers, but that would be my next step.
<dooferlad> so, looking at machine-0-lxc-0.log, after the upgrade Juju sees "local-cloud:172.20.171.204" "local-cloud:172.20.168.104"
<dooferlad> this is the same as before the upgrade
<chrome0> where local-cloud:172.20.168.104 is the VIP
<dooferlad> yes
<dooferlad> is there basic network connectivity after the upgrade? Is it a case of "something picked the wrong IP address from that pair"?
<chrome0> basic network conn. is there, yes, if not using the VIP
<chrome0> and "something picked the wrong IP from that pair" sounds right as well :-)
<dooferlad> so the VIP isn't usable inside the cluster? Just outside the cluster it does work?
<chrome0> the VIP should be usable from the cluster, but i remember not being able to ssh' to it
<chrome0> "juju ssh
<chrome0> that is
<dooferlad> so, 172.20.168.104 doesn't show up in the unit-mysql-0.log after the upgrade, neither does 172.20.171.204.
<dooferlad> in fact it looks like 0-lxc-0 doesn't actually come back up with network connectivity
<dooferlad> so I would save "ip route -n" pre and post upgrade for machine 0 and machin 0-lxc-0
<dooferlad> and also "sudo iptables-save" (pre and post)
<chrome0> on staging we've had to reboot lxc's because of https://bugs.launchpad.net/juju-core/+bug/1416928
<dooferlad> but that should have been fixed after the upgrade
<chrome0> apparently it hit during/before the upgrade...
<chrome0> aiui
<dooferlad> OK, so from machine-0-lxc-0.log, lne 16203 onwards we are trying to upgrade, but fail because we can't download the tools. So, that unit didn't upgrade at all.
<dooferlad> machine-0, machine-1 and all other LXCs got the new jujud.
<dooferlad> so if the VIP was pointing at machine-0-lxc-0, then that is a big problem. The VIP needed to point to another machine when 0-lxc-0 didn't upgrade
<dooferlad> Both 0-lxc-0 and 0-lxc-1 tried to download from the same location at the same time, both failed, only 0-lxc-1 tried another location.
<dooferlad> it looks like 0-lxc-0 lost its connection to state and crapped out.
It looks like 0-lxc-0 lost its connection to state during the upgrade to 1.22. It tried to download tools from a location that didn't work and didn't try anywhere else. 0-lxc-1 encountered the same problem, but then tried another tools location. I haven't worked out why there is a difference between the two containers, but that would be my next step.
<dooferlad> so, looking at machine- 0-lxc-0. log, after the upgrade Juju sees "local- cloud:172. 20.171. 204" "local- cloud:172. 20.168. 104" 172.20. 168.104 is the VIP /bugs.launchpad .net/juju- core/+bug/ 1416928 0-lxc-0. log, lne 16203 onwards we are trying to upgrade, but fail because we can't download the tools. So, that unit didn't upgrade at all.
<dooferlad> this is the same as before the upgrade
<chrome0> where local-cloud:
<dooferlad> yes
<dooferlad> is there basic network connectivity after the upgrade? Is it a case of "something picked the wrong IP address from that pair"?
<chrome0> basic network conn. is there, yes, if not using the VIP
<chrome0> and "something picked the wrong IP from that pair" sounds right as well :-)
<dooferlad> so the VIP isn't usable inside the cluster? Just outside the cluster it does work?
<chrome0> the VIP should be usable from the cluster, but i remember not being able to ssh' to it
<chrome0> "juju ssh
<chrome0> that is
<dooferlad> so, 172.20.168.104 doesn't show up in the unit-mysql-0.log after the upgrade, neither does 172.20.171.204.
<dooferlad> in fact it looks like 0-lxc-0 doesn't actually come back up with network connectivity
<dooferlad> so I would save "ip route -n" pre and post upgrade for machine 0 and machin 0-lxc-0
<dooferlad> and also "sudo iptables-save" (pre and post)
<chrome0> on staging we've had to reboot lxc's because of https:/
<dooferlad> but that should have been fixed after the upgrade
<chrome0> apparently it hit during/before the upgrade...
<chrome0> aiui
<dooferlad> OK, so from machine-
<dooferlad> machine-0, machine-1 and all other LXCs got the new jujud.
<dooferlad> so if the VIP was pointing at machine-0-lxc-0, then that is a big problem. The VIP needed to point to another machine when 0-lxc-0 didn't upgrade
<dooferlad> Both 0-lxc-0 and 0-lxc-1 tried to download from the same location at the same time, both failed, only 0-lxc-1 tried another location.
<dooferlad> it looks like 0-lxc-0 lost its connection to state and crapped out.