quickstart thinks the unit is started when it's still being installed

Bug #1450191 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Ian Booth
1.24
Fix Released
Critical
Ian Booth
juju-quickstart
Invalid
Undecided
Unassigned

Bug Description

As seen in
     http://reports.vapour.ws/releases/2576/job/maas17-quickstart-bundle/attempt/561

quickstart cannot talk to the state-server after it is bootstrapped. We know this maas 1.7 works with the same juju for deployments, upgrades, and deployer. We know it works with sable juju and the 1.23.2 release candidate. Only quickstart is broken.

This issue was observed after the fix for bug 1441826 where we saw all the other quickstart jobs pass. Only this one failed. The quickstart job is deploying the landscape-scalable bundle, but since we never get past bootstrap, the bundle is moot.

Revision history for this message
John George (jog) wrote :

With export JUJU=/tmp/temp_qs_workspace/extracted-bin/usr/lib/juju-1.24-alpha1/bin/juju
Where the extracted juju binary is from revision build 2576 (Revision ID: 31762378d472fbe596b5b4c15e46e5489cc32bb9)
The following manual execution of juju quickstart demonstrates the error, with debugging enabled.

juju quickstart --debug -e maas-env1 --constraints 'mem=2G arch=amd64' --no-browser /var/lib/jenkins/repository/landscape-scalable.yaml

Please see the attached logs.?field.comment=With export JUJU=/tmp/temp_qs_workspace/extracted-bin/usr/lib/juju-1.24-alpha1/bin/juju
Where the extracted juju binary is from revision build 2576 (Revision ID: 31762378d472fbe596b5b4c15e46e5489cc32bb9)
The following manual execution of juju quickstart demonstrates the error, with debugging enabled.

juju quickstart --debug -e maas-env1 --constraints 'mem=2G arch=amd64' --no-browser /var/lib/jenkins/repository/landscape-scalable.yaml

Please see the attached logs.?field.comment=With export JUJU=/tmp/temp_qs_workspace/extracted-bin/usr/lib/juju-1.24-alpha1/bin/juju
Where the extracted juju binary is from revision build 2576 (Revision ID: 31762378d472fbe596b5b4c15e46e5489cc32bb9)
The following manual execution of juju quickstart demonstrates the error, with debugging enabled.

juju quickstart --debug -e maas-env1 --constraints 'mem=2G arch=amd64' --no-browser /var/lib/jenkins/repository/landscape-scalable.yaml

Please see the attached logs.

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

I don't know much about quickstart, but here's my thoughts....

Juju reports the known node addresses as:

2015-04-29 20:40:59 INFO juju.worker.certupdater certupdater.go:127 State Server cerificate addresses updated to ["public:maas-node-214.maas" "local-cloud:10.0.40.156"]
I don't think this is a Juju issue - it appears DNS related.

quickstart is attempting to connect to the MAAS node on it's cloud internal (private) address, not the public address which is a DNS name.

ie units and other things can connect to wss://maas-node-214.maas:17070/environment/bcef503c-45ef-42f3-804f-7e6b34040cba/api

quickstart is attempting to connect to wss://10.0.40.156:443/ws/environment/bcef503c-45ef-42f3-804f-7e6b34040cba/api

The quickstart logs show:

20:41:16 WARNING watchers:106 cannot resolve public maas-node-214.maas address, looking for another candidate: [Errno -2] Name or service not known
unit placed on 10.0.40.156

It appears quickstart can't resolve the DNS name, so instead chooses the private IP address.

Revision history for this message
John George (jog) wrote :

10.0.40.* is the public subnet in this environment. If the test driver was configured to use the MAAS server for name resolution maas-node-214.maas would resolve to 10.0.40.156

dig +short @10.0.40.100 maas-node-214.maas
10.0.40.156

Revision history for this message
Ian Booth (wallyworld) wrote :

This issue isn't maas specific, but probably shows up there due to the fact that MAAS runs faster on real hardware compared with cloud deployments.

megawatcher reports that the unit is started before it is ready because older charms which don't yet set status don't set the status to active and so a best guess is made. Clearly the guess is sometimes wrong.

summary: - quickstart cannot talk juju on maas 1.7
+ quickstart thinks the unit is started when it's still being installed
Revision history for this message
Ian Booth (wallyworld) wrote :
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Changed in juju-quickstart:
status: New → Invalid
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Curtis Hovey (sinzui) wrote :

The revision did not pass. We need to know if juju is still broken or if the maas cannot talk to the charm store. I see this error:

charm URL: cs:trusty/juju-gui-27
requesting juju-gui deployment
juju-quickstart: error: bad API response: charm "cs:trusty/juju-gui-27" not found
2015-04-30 14:43:01 ERROR juju.cmd supercommand.go:430 subprocess encountered error code 1
2015-04-30 14:43:01 ERROR Command '('juju', '--show-log', 'quickstart', '-e', 'maas17-quickstart-bundle', '--constraints', 'mem=2G arch=amd64', '--no-browser', '/var/lib/jenkins/repository/landscape-scalable.yaml')' returned non-zero exit status 1

^ That is an old version of the juju-gui charm. it cannot disappear from the store.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → In Progress
Revision history for this message
John George (jog) wrote :

This same test, using a different juju version, passed on the MAAS environment before and after this 1.24 failure. Since these other tests installed cs:trusty/juju-gui-27 successfully, I think that rules out network setup issues with contacting the charm store.

Revision history for this message
Ian Booth (wallyworld) wrote :

If there's a different issue, we need to raise a new bug. The issue which resulted in this bug being raised, that quickstart thought the charm had started when in fact it was still installing, has been fixed.

The testing of the revision with this fix failed in some cases deploying the bundle due to:

    containers:
      2/lxc/0:
        agent-state-info: 'failed to retrieve the template to clone: template container
          "juju-trusty-lxc-template" did not stop'
        instance-id: pending
        series: trusty
      2/lxc/1:
        agent-state-info: 'lxc container cloning failed: cannot clone a running container'
        instance-id: pending
        series: trusty

This is already reported as bug 1441319

I'm marking this as fix committed again as this issue was indeed fixed.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Curtis Hovey (sinzui) wrote :

I hesitate to say this is fixed released when all the quickstart jobs are now broken, and the job that was broken is still broken. For the sake of managing bugs. I am closing this bug and opening a blocker that describes the mutation. see bug 1450912

Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.