cloud-init doesn't retry metadata lookups and hangs forever if metadata is down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-init |
Fix Released
|
Medium
|
Mike Gerdts | ||
cloud-init (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Precise |
Won't Fix
|
Medium
|
Unassigned | ||
Trusty |
Confirmed
|
Medium
|
Unassigned |
Bug Description
If a host SmartOS server is rebooted and the metadata service is not available, a KVM VM instance that use cloud-init (via the SmartOS datasource) will fail to start.
If the metadata agent on the host server is not available the python code for cloud-init gets blocked forever waiting for data it will never receive. This causes the boot process for an instance to hang on cloud-init.
This is problematic if there happens to be some reason the metadata agent is not available for any reason while a SmartOS KVM VM that relies on cloud-init is booting.
From the engineer that worked on this (not the svadm command is run on the host SmartOS server):
You can reproduce this by disabling the metadata service SmartOS host:
svcadm disable metadata
and then boot a KVM VM running an Ubuntu Certified Cloud image such as:
c864f104-
when you do this, the VM's boot process will hang at cloud-init. If you then start the metadata service, cloud-init will not recover.
On of our engineers who looked at this was able to cause forward progress by applying this patch:
--- /usr/lib/
+++ /usr/lib/
@@ -286,7 +286,7 @@
if not seed_device:
raise AttributeError(
- ser = serial.
+ ser = serial.
if not ser.isOpen():
raise SystemError("Unable to open %s" % seed_device)
which causes the following strace output:
[pid 2119] open("/dev/ttyS1", O_RDWR|
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_
[pid 2119] write(5, "GET user-script\n", 16) = 16
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0
[pid 2119] open("/dev/ttyS1", O_RDWR|
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_
[pid 2119] write(5, "GET iptables_
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0
instead of:
[pid 1977] open("/dev/ttyS1", O_RDWR|
[pid 1977] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_
[pid 1977] write(5, "GET base64_keys\n", 16) = 16
[pid 1977] select(6, [5], [], [], NULL
which you get without the patch (notice the NULL for the timeout parameter). The code that gets blocked in this version of cloud-init is:
ser.write("GET %s\n" % noun.rstrip())
status = str(ser.
in cloudinit/
(https:/
Be careful when using readline(). Do specify a timeout when opening the serial port otherwise it could block forever if no newline character is received. Also note that readlines() only works with a timeout. readlines() depends on having a timeout and interprets that as EOF (end of file). It raises an exception if the port is not opened correctly.
which is exactly the situation we've hit here.
It might be better to have a timeout but when the timeout is hit, the GET should be retried if there's no answer rather than moving on to the next key. A negative answer (NOTFOUND for example) should not be retried, only when there's no answer (because metadata is unavailable).
Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time.
Related branches
- Server Team CI bot: Approve (continuous-integration)
- cloud-init Commiters: Pending requested
-
Diff: 1818 lines (+897/-236)35 files modifiedMANIFEST.in (+1/-0)
bash_completion/cloud-init (+77/-0)
cloudinit/analyze/__main__.py (+1/-1)
cloudinit/config/cc_apt_configure.py (+1/-1)
cloudinit/config/cc_disable_ec2_metadata.py (+12/-2)
cloudinit/config/cc_power_state_change.py (+1/-1)
cloudinit/config/cc_rsyslog.py (+2/-2)
cloudinit/config/tests/test_disable_ec2_metadata.py (+50/-0)
cloudinit/distros/freebsd.py (+3/-3)
cloudinit/net/network_state.py (+5/-6)
cloudinit/netinfo.py (+273/-72)
cloudinit/sources/DataSourceSmartOS.py (+103/-16)
cloudinit/tests/helpers.py (+14/-26)
cloudinit/tests/test_netinfo.py (+101/-85)
cloudinit/util.py (+3/-3)
debian/changelog (+13/-0)
doc/examples/cloud-config-disk-setup.txt (+2/-2)
packages/redhat/cloud-init.spec.in (+1/-0)
packages/suse/cloud-init.spec.in (+1/-0)
setup.py (+1/-0)
tests/cloud_tests/testcases/base.py (+1/-1)
tests/data/netinfo/netdev-formatted-output (+10/-0)
tests/data/netinfo/new-ifconfig-output (+18/-0)
tests/data/netinfo/old-ifconfig-output (+18/-0)
tests/data/netinfo/route-formatted-output (+22/-0)
tests/data/netinfo/sample-ipaddrshow-output (+13/-0)
tests/data/netinfo/sample-iproute-output-v4 (+3/-0)
tests/data/netinfo/sample-iproute-output-v6 (+11/-0)
tests/data/netinfo/sample-route-output-v4 (+5/-0)
tests/data/netinfo/sample-route-output-v6 (+13/-0)
tests/unittests/test_datasource/test_smartos.py (+102/-1)
tests/unittests/test_filters/test_launch_index.py (+5/-5)
tests/unittests/test_merging.py (+1/-1)
tests/unittests/test_runs/test_merge_run.py (+1/-1)
tests/unittests/test_util.py (+9/-7)
- Server Team CI bot: Approve (continuous-integration)
- Scott Moser: Approve
-
Diff: 353 lines (+204/-16)2 files modifiedcloudinit/sources/DataSourceSmartOS.py (+102/-15)
tests/unittests/test_datasource/test_smartos.py (+102/-1)
Changed in cloud-init (Ubuntu): | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Changed in cloud-init: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Changed in cloud-init (Ubuntu): | |
status: | Confirmed → Fix Released |
Changed in cloud-init (Ubuntu Precise): | |
status: | New → Confirmed |
Changed in cloud-init (Ubuntu Trusty): | |
status: | New → Confirmed |
Changed in cloud-init (Ubuntu Precise): | |
importance: | Undecided → Medium |
Changed in cloud-init (Ubuntu Trusty): | |
importance: | Undecided → Medium |
Changed in cloud-init (Ubuntu): | |
status: | Fix Released → Confirmed |
Changed in cloud-init (Ubuntu Precise): | |
status: | Confirmed → Won't Fix |
Changed in cloud-init: | |
assignee: | nobody → Mike Gerdts (mgerdts) |
"Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time."
That doesn't really make sense. And re-trying timing out in 10 seconds and re-trying wouldn't really change anything.
Either you block the boot waiting for the other end of the read() or you don't. "have the boot process complete at that time" is I think what we have in precise and trusty right now.
boot will hang until the read() came back. I'm not sure if upstart boot will actually wait forever on cloud-init-local to return , it may well do that. I suspect that systemd will kill cloud-init after like 90 seconds.
Its very arguable that the *right* thing to do is wait forever on the metadata service. Cloud-init only knows what it should do next based on information from the metadata service. If this is the first boot of an instance, then some things will happen, if it is a reboot, then things such as ssh key generation or user creation will not occur.
So this is *kind of* working by design. Sure we can change the setting to give up, but that will just result in cloud-init not finding a datasource. If that was a instance's first boot, then the user wouldn't be able to access the system. If it was any other boot, then things would be fine, but cloud-init has no way of knowing that this is "second boot" other than by the metadata service.