Comment 5 for bug 1473527

Revision history for this message
Wesley Wiedenmeier (wesley-wiedenmeier) wrote :

I spent a while trying to reproduce this well:

  - I tried many vm configurations running 'serial-shell-looper', both manually
    and started by cloud-init but it didn't break
  - I wrote a similar script using util.meta_log to see if the difference in
    implementation between python open() and shell piping would make a
    difference. I wasn't able to find anything useful though.
    http://paste.ubuntu.com/15843875/
  - I was able to reproduce about 9 times out of 10 using XenialTestBasic with
    no modifications. After removing most of the functionality of the test
    other than basic booting (no curtin cmd, no curtin archive, no extra
    disks), it still failed just as reliably. There were still occasional cases
    where there was no failure though
  - Since these failures have been occurring much more recently, I reverted the
    net.ifnames=0 removal and ran vmtests several times, and did not see any
    failures. I enabled and disabled this parameter many times to make sure,
    but it appears that this issue appears almost always with ifnames enabled
    and never with them disabled, suggesting that somehow that I haven't
    figured out yet naming of network devices is shifting timing enough that it
    can toggle this error on and off
  - I was able to reproduce it just as often using a modified version of the
    cloud-init vmtests, using both a cloud-init deb built from the current
    revision of cloud-init and a deb built from cloud-init at revision 1188,
    before the new networking code was merged in. In both versions, this error
    almost always occurred when running with ifnames enabled and never occurred
    when running with ifnames disabled.

I'm not really sure how to reproduce this error on a small scale yet. I am
going to try to figure out what could be running concurrently with
cc_ssh_authkey_fingerprints and see if I can figure anything else out from
there. I haven't yet tried disabling StandardOutput=journal+console in
cloud-final.service, but I will give that a try as well, although it is already
present in wily and wily does not seem to have this issue

The only idea I have so far for underlying cause is flow control on
/dev/console. Since serial console is being forwarded to a file over ipmi by
qemu and is write only it may be possible that somehow something expects to
read from there (maybe agetty?) and flow control is causing a block. I'm not
sure if that makes sense though. The main thing that suggests that is a series
of bugs in several different mailing lists about syslog-ng writing directly to
/dev/console causing hangs in some situations such as when traffic from
/dev/console is being forwarded to a device that temporarily goes offline,
causing write to block.
http://comments.gmane.org/gmane.comp.syslog-ng/10561