I've found a few cases of this:
https://ci.lscape.net/job/landscape-system-tests/4940
https://ci.lscape.net/job/landscape-system-tests/4952
https://ci.lscape.net/job/landscape-system-tests/4963
The symptoms are an lxd unit being left in the 'pending' state despite the deployment being several hours old (the deployment times out after 4 hours). The process listing also shows the lxc process still cloud-init (see https://pastebin.canonical.com/175702/ for better formatting):
[from ps-fauxww.txt]
root 5617 0.0 0.0 73384 4024 ? Ss 14:44 0:00 [lxc monitor] /var/lib/lxd/containers juju-8a4dbf
-1-lxd-1
100000 5655 0.0 0.0 37556 5560 ? Ss 14:44 0:01 \_ /sbin/init
100000 5870 0.0 0.0 41720 3240 ? Ss 14:44 0:00 \_ /lib/systemd/systemd-udevd
100000 5879 0.0 0.0 52052 14920 ? Ss 14:44 0:02 \_ /lib/systemd/systemd-journald
100000 6231 0.0 0.0 65520 5868 ? Ss 14:44 0:00 \_ /usr/sbin/sshd -D
100001 6237 0.0 0.0 26044 2192 ? Ss 14:44 0:00 \_ /usr/sbin/atd -f
100107 6243 0.0 0.0 42892 3972 ? Ss 14:44 0:00 \_ /usr/bin/dbus-daemon --system --address=s
ystemd: --nofork --nopidfile --systemd-activation
100000 6249 0.0 0.0 209044 12548 ? Ssl 14:44 0:00 \_ /usr/lib/snapd/snapd
100000 6269 0.0 0.0 26068 2560 ? Ss 14:44 0:00 \_ /usr/sbin/cron -f
100000 6274 0.0 0.0 272940 5912 ? Ssl 14:44 0:00 \_ /usr/lib/accountsservice/accounts-daemon
100000 6277 0.0 0.0 20100 1176 ? Ss 14:44 0:00 \_ /lib/systemd/systemd-logind
100104 6282 0.0 0.0 186900 3304 ? Ssl 14:44 0:00 \_ /usr/sbin/rsyslogd -n
100000 6314 0.0 0.0 277180 6136 ? Ssl 14:44 0:00 \_ /usr/lib/policykit-1/polkitd --no-debug
100000 6357 0.0 0.0 12844 1776 pts/1 Ss+ 14:44 0:00 \_ /sbin/agetty --noclear --keep-baud consol
e 115200 38400 9600 linux
100000 6724 0.0 0.1 93124 30556 ? Ss 14:44 0:00 \_ /usr/bin/python3 /usr/bin/cloud-init modules --mode=final
100000 6882 0.0 0.0 4508 800 ? S 14:44 0:00 \_ /bin/sh -c tee -a /var/log/cloud-init-output.log
100000 6883 0.0 0.0 4384 764 ? S 14:44 0:00 | \_ tee -a /var/log/cloud-init-output.log
100000 6998 0.0 0.0 4508 1668 ? S 14:44 0:00 \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd
100000 713392 0.0 0.0 4380 652 ? S 18:16 0:00 \_ sleep 15
Next step is to catch this on a live system and extract the cloud-init logs from the host. Unfortunately, these are not currently saved in the log dump.
I've hit this a few times also. In all cases it was only the first LXD container to spawn (first time-wise not necessarily container number). All containers spawning seconds later started ok.
It looks like containers are launched before the network is fully setup.
I'm attaching cloud-init logs for the last failure I've hit.