machines can be half-added and thereby unable to be removed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Medium
|
Unassigned |
Bug Description
Hi,
I was facing an issue around addin/removing machines.
My setup is a loal juju (client) a canonistack juju controller and a canonistack machine that I wanted to add-machine.
I managed to get there and had the machine added:
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
default manual-
Machine State DNS Inst id Series AZ Message
0 started 10.48.131.170 manual:
Then I wanted to deploy a charm and forgot that it is only a subordiante.
$ juju deploy /tmp/charm-
Located local charm "ntp", revision 0
ERROR cannot use --num-units or --to with subordinate application
Fine you'd think, then use it differntly.
I thought to deploy something else that isn't a suboridnate
$ juju deploy ubuntu --to 0
Located charm "ubuntu" in charm-hub, revision 19
Deploying "ubuntu" from charm-hub charm "ubuntu", revision 19 in channel stable
ERROR cannot deploy "ubuntu" to machine 0: machine 0 not found
But the machine was gone.
No ID 0 anymore ??
Well ok let us add it again
$ juju add-machine ssh:ubuntu@
ERROR machine is already provisioned
Hmm, ok then let us remove it to re-add cleanly
$ juju remove-machine 0
removing machine 0 failed: machine 0 not found
$ juju machines
Machine State DNS Inst id Series AZ Message
So my machine is gone and I can't use it, but I also can't add it.
This is locking me out of everything and all I can do right now is purging all configuration for a retry.
tags: | added: manual-provider |
Changed in juju: | |
importance: | Undecided → Medium |
status: | New → Triaged |
milestone: | none → 2.9-next |
Changed in juju: | |
importance: | Low → Medium |
milestone: | 2.9-next → 3.2-beta1 |
Changed in juju: | |
milestone: | 3.2-beta1 → 3.2-rc1 |
Changed in juju: | |
milestone: | 3.2-rc1 → 3.2.0 |
Changed in juju: | |
milestone: | 3.2.0 → 3.2.1 |
Changed in juju: | |
milestone: | 3.2.1 → 3.2.2 |
Changed in juju: | |
milestone: | 3.2.2 → 3.2.3 |
Changed in juju: | |
milestone: | 3.2.3 → 3.2.4 |
I found this on the target:
4 0 149497 1 20 0 9068 3592 - Ss ? 0:00 bash /etc/systemd/ system/ jujud-machine- 0-exec- start.sh juju/tools/ machine- 0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
4 0 149502 149497 20 0 846588 92660 - SLl ? 1:41 \_ /var/lib/
So I cleared things via: remove- juju-services
$ sudo /usr/sbin/
That allowed me to add it again.
IMHO this is a situation that can be detected and handled much better
I'd ask to: juju-services for the user.
a) At least offer the user a better error message than "ERROR machine is already provisioned" based on the metadata you found there like
"ERROR machine is already provisioned - for controller X on IP Y, at data Z"
b) it would be very helpful to then offer "do you want to clean and re-add the machine" which would then call remove-
Right now I realize that bug 1933819 and this one come down to almost the same root cause, just once for machine-add and once for controller- boostrap. If you want to implement/fix this in one, then feel free to dup the two together.