MaaS/Curtin fail to unmount /target/run way too often
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Medium
|
Unassigned | ||
curtin |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
I'm in the process of iteratively rebuilding OpenStack on Juju+MaaS, and every few deploys i get 1-all physical hosts failing with:
```
TIMED subp(['udevadm', 'settle']): 0.010
Running command ['umount', '/tmp/tmpj0gpgg
Running command ['umount', '/tmp/tmpj0gpgg
Running command ['umount', '/tmp/tmpj0gpgg
umount: /tmp/tmpj0gpggk
finish: cmd-install/
finish: cmd-install/
Traceback (most recent call last):
```
or variants thereof wherein unmounting `.../target/run` fails, which fails to provision the node, and breaks juju's attempts at openstack deployment. Juju's inability to `retry-
The problem is intermittent - some runs, all nodes are good, run after run. Then suddenly one or several, or ALL, will fail with varying versions of log files leading up to the failure to unmount `target/run`. After that, several subsequent deployments fail with these errors, and then it seems to go away.
We see this with our Arch Linux process wherein gpg-agents can get stuck running in the chroot context (well, nspawn) for package operations, requiring manually killing them to unmount the target from the parent. We addressed this in our workflows by having a check for running processes before existing the chroot which kills off anything remaining after waiting for it to complete efforts for a minute.
Related branches
- Ryan Harper (community): Approve
- Dan Bungert: Approve
- Server Team CI bot: Approve (continuous-integration)
-
Diff: 118 lines (+35/-9)2 files modifiedcurtin/util.py (+20/-1)
tests/unittests/test_curthooks.py (+15/-8)
Changed in maas: | |
milestone: | none → next |
Changed in maas: | |
milestone: | 3.2.0 → 3.2.0-beta3 |
status: | Fix Committed → Fix Released |
It appears that the last command to be run is `apt-get clean`, forked, in a namespace inside the chroot: 27/target' , 'apt-get', 'clean'] with allowed return codes [0] (capture=False) 27/target/ sys/firmware/ efi/efivars' ] with allowed return codes [0] (capture=False) 27/target/ sys'] with allowed return codes [0] (capture=False) 27/target/ run'] with allowed return codes [0] (capture=False) 7/target/ run: target is busy.
```
Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmphi85v8
Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
TIMED subp(['udevadm', 'settle']): 0.013
Running command ['umount', '/tmp/tmphi85v8
Running command ['umount', '/tmp/tmphi85v8
Running command ['umount', '/tmp/tmphi85v8
umount: /tmp/tmphi85v82
```
My guess is that this is still executing when the unmounts occur.
Happens on XFS, EXT4, or my preferred ZFS (which for some reason is run with no features enabled, resulting in significantly higher numbers of IOs issued to the underlying media).
The OS storage media for the compute/OSD nodes is 128G of SD card on the iLO (leaving storage bus free for Ceph), and these all passed even the abuse FIO threw at them via MaaS testing. They showed ~20MB/s on sync-io, and depending on what apt cached, are probably still unlinking inodes when Curtin tries to unmount /run from the target.
Its not uncommon for commercial hypervisors to deploy this way, and i think comes down to a missing safety latch in Curtin, waiting for slower install media at the end of the deployment.