systemd randomly fails to activate mount units in Ubuntu Core 18

Bug #1949089 reported by Alberto Mardegan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
New
Undecided
Unassigned
Bionic
New
Undecided
Lukas Märdian

Bug Description

Since a month or so, we've been seeing random failures in our snapd spread tests where systemd could not start the mount unit associated with a snap because of a failed dependency.

The issue is described in the comments to PR https://github.com/snapcore/snapd/pull/10935, but I'll summarize it here.

When starting a snap, snapd creates a mount unit to mount the snap's squashfs (the template is https://github.com/snapcore/snapd/blob/release/2.53/systemd/systemd.go#L1186-L1205). The snapd asks systemd to reload the configuration, and starts the mount unit.

The failure we've observed is that sometimes systemd decides to stop our mount unit (search for "Unmounting Mount unit for test-snapd-svc-flip-flop" in the attached log), and then tries to reactivate it again, and at that point it fails.

When I asked for help, Lukas pointed out that the latest update contains a patch that is related to reload handling and mount units: http://launchpadlibrarian.net/555420796/systemd_237-3ubuntu10.51_237-3ubuntu10.52.diff.gz (the patch itself is better visible at https://github.com/systemd/systemd/commit/f0831ed2a03fcef582660be1c3b1a9f3e267e656). When looking at the systemd git log, though, I noticed another patch that was applied shortly after this one, which also seems related but was not backported: https://github.com/systemd/systemd/commit/04eb582acc203eab0bc5c2cc5e13986f16e09df0

Since the stopping of our mount unit happens immediately after a systemd reload, it actually seems very likely that the inclusion of f0831ed2a03fcef582660be1c3b1a9f3e267e656 in the systemd update is what causes our woes (though, indeed, the issue is not reliably reproducible, so we cannot be sure).

Related branches

Revision history for this message
Alberto Mardegan (mardy) wrote :
tags: added: rls-jj-incoming
Revision history for this message
Lukas Märdian (slyon) wrote :

Not having a reliable reproducer is always a tricky situation to be in...

Indeed that other commit might be related. I've backported the following patches and created a custom systemd build in https://launchpad.net/~slyon/+archive/ubuntu/lp1949089

https://github.com/systemd/systemd-stable/commit/cc6271f17d3c5c200e8a391f10d0afb71403fc28
https://github.com/systemd/systemd-stable/commit/04eb582acc203eab0bc5c2cc5e13986f16e09df0

Can you confirm that the problems you observed are gone using this systemd build?

There's also a custom UC18 image I assembled, that contains this version of systemd: https://people.ubuntu.com/~slyon/uc18/

Revision history for this message
Ian Johnson (anonymouse67) wrote :

@slyon could you make available the core18 snap you built with this systemd ?

Revision history for this message
Lukas Märdian (slyon) wrote :

Sure, I put that custom core18 snap into https://people.ubuntu.com/~slyon/uc18/ as well.

tags: added: fr-1832
tags: removed: rls-jj-incoming
Revision history for this message
Michael Vogt (mvo) wrote :

We created a test PR that uses this core18 snap build in https://github.com/snapcore/snapd/pull/11015 - we are still running tests but it seems the error is still there :-(

Revision history for this message
Lukas Märdian (slyon) wrote :

Sounds like we cannot have an easy fix by simply applying 04eb582acc203eab0bc5c2cc5e13986f16e09df0 (and its dependencies).

So let's first try to pinpoint if f0831ed2a03fcef582660be1c3b1a9f3e267e656 really introduced the regression. I've built a new core18 snap with my previous work and the inclusion of f0831ed2a03fcef582660be1c3b1a9f3e267e656 (and dependencies) reverted:
https://people.ubuntu.com/~slyon/uc18/core18_20211105_amd64.snap

Could you please give this another try?

Changed in systemd (Ubuntu Bionic):
assignee: nobody → Lukas Märdian (slyon)
Revision history for this message
Michael Vogt (mvo) wrote :
Download full text (8.2 KiB)

Thanks Lukas for providing this revert of f0831ed2a03fcef582660be1c3b1a9f3e267e656. Using https://people.ubuntu.com/~slyon/uc18/core18_20211105_amd64.snap I can no longer reproduce the isue.

To make it easier to reproduce what I did:
"""
$ git status
On branch tests-use-core18-from-gce
Your branch is up to date with 'sergiocazzolato/tests-use-core18-from-gce'.
...
$ git diff
diff --git a/tests/lib/prepare.sh b/tests/lib/prepare.sh
index e6b984c4d0..5c58a55b2f 100755
--- a/tests/lib/prepare.sh
+++ b/tests/lib/prepare.sh
@@ -973,7 +973,7 @@ EOF
     fi

     if os.query is-core18; then
- curl -s -o core18.snap https://storage.googleapis.com/snapd-spread-tests/snaps/core18_20211102_amd64.snap
+ curl -s -o core18.snap https://people.ubuntu.com/~slyon/uc18/core18_20211105_amd64.snap
         EXTRA_FUNDAMENTAL="$EXTRA_FUNDAMENTAL --snap $PWD/core18.snap"
     fi

$ $GOPATH/bin/spread -repeat 100 google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy
...
2021-11-08 11:39:41 Preparing google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy (nov081028-580805)...
2021-11-08 11:39:53 Executing google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy (nov081028-580805) (1/1)...
2021-11-08 11:40:49 Restoring google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy (nov081028-580805)...
2021-11-08 11:41:00 Preparing google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy (nov081028-580805)...
...

"""

To validate my findings I'm also running the same test against core18 in stable and it fails very quickly:
"""
$ git status
HEAD detached at upstream/master
$ git describe
2.53.1-480-g2c39794030
$ $GOPATH/bin/spread -repeat 100 google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy
....
2021-11-08 11:58:46 Error executing google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy (nov081048-716592) :
-----
+ . /home/gopath/src/github.com/snapcore/snapd/tests/lib/disabled-svcs.sh
++ SVC_MISSING_ERR_MSG='state.json is missing last-active-disabled-services in it:'
+ echo 'CASE 1'
CASE 1
+ echo 'Install the snap'
Install the snap
+ snap install --dangerous disabled-svcs-kept_1.0_all.snap
disabled-svcs-kept 1.0 installed
+ echo 'Check that state.json doesn'\''t contain last-active-disabled-services'
Check that state.json doesn't contain last-active-disabled-services
+ check_state_json_no_disabled_svcs
+ /home/gopath/src/github.com/snapcore/snapd/tests/lib/tools/snapd-state check-state '.data.snaps."disabled-svcs-kept" | ."last-active-disabled-services"?' = null 'state.json has invalid last-active-disabled-services in it:'
+ echo 'CASE 2'
CASE 2
+ echo 'Disable a service in the snap'
Disable a service in the snap
+ snap stop --disable disabled-svcs-kept.svc
Stopped.
+ echo 'Check that it was actually disabled'
Check that it was actually disabled
+ retry -n 10 --wait 1 sh -c 'snap services disabled-svcs-kept | MATCH "disabled-svcs-kept\\.svc\\s+disabled\\s+inactive"'
+ echo 'Check that state.json still doesn'\''t contain last-active-disabled-services'
Check that state.json still doesn't contain last-active-disabled-services
+ check_state_json_no_disabled_svcs
+ /home/gopath/src/gith...

Read more...

Revision history for this message
Michael Vogt (mvo) wrote :
Download full text (11.4 KiB)

Just for the record, I also did another run with: https://storage.googleapis.com/snapd-spread-tests/snaps/core18_20211102_amd64.snap and that also fails right away:

"""
$ git status
On branch tests-use-core18-from-gce
Your branch is up to date with 'sergiocazzolato/tests-use-core18-from-gce'.
$ git diff
[empty]
$ $GOPATH/bin/spread -repeat 100 google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy
...
2021-11-08 12:13:57 Error executing google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy (nov081101-381415) :
-----
+ . /home/gopath/src/github.com/snapcore/snapd/tests/lib/disabled-svcs.sh
++ SVC_MISSING_ERR_MSG='state.json is missing last-active-disabled-services in it:'
+ echo 'CASE 1'
CASE 1
+ echo 'Install the snap'
Install the snap
+ snap install --dangerous disabled-svcs-kept_1.0_all.snap
disabled-svcs-kept 1.0 installed
+ echo 'Check that state.json doesn'\''t contain last-active-disabled-services'
Check that state.json doesn't contain last-active-disabled-services
+ check_state_json_no_disabled_svcs
+ /home/gopath/src/github.com/snapcore/snapd/tests/lib/tools/snapd-state check-state '.data.snaps."disabled-svcs-kept" | ."last-active-disabled-services"?' = null 'state.json has invalid last-active-disabled-services in it:'
+ echo 'CASE 2'
CASE 2
+ echo 'Disable a service in the snap'
Disable a service in the snap
+ snap stop --disable disabled-svcs-kept.svc
Stopped.
+ echo 'Check that it was actually disabled'
Check that it was actually disabled
+ retry -n 10 --wait 1 sh -c 'snap services disabled-svcs-kept | MATCH "disabled-svcs-kept\\.svc\\s+disabled\\s+inactive"'
+ echo 'Check that state.json still doesn'\''t contain last-active-disabled-services'
Check that state.json still doesn't contain last-active-disabled-services
+ check_state_json_no_disabled_svcs
+ /home/gopath/src/github.com/snapcore/snapd/tests/lib/tools/snapd-state check-state '.data.snaps."disabled-svcs-kept" | ."last-active-disabled-services"?' = null 'state.json has invalid last-active-disabled-services in it:'
+ echo 'Disable the whole snap'
Disable the whole snap
+ snap disable disabled-svcs-kept
disabled-svcs-kept disabled
+ echo 'Check that state.json DOES contain last-active-disabled-services'
Check that state.json DOES contain last-active-disabled-services
+ check_state_json_yes_disabled_svcs
+ /home/gopath/src/github.com/snapcore/snapd/tests/lib/tools/snapd-state check-state '.data.snaps."disabled-svcs-kept" | ."last-active-disabled-services"?' '!=' null 'state.json has invalid last-active-disabled-services in it:'
+ echo 'Enable the whole snap'
Enable the whole snap
+ snap enable disabled-svcs-kept
disabled-svcs-kept enabled
+ echo 'Check that the service is still disabled'
Check that the service is still disabled
+ MATCH 'disabled-svcs-kept\.svc\s+disabled\s+inactive'
+ snap services disabled-svcs-kept
+ echo 'Check that state.json still doesn'\''t contain last-active-disabled-services'
Check that state.json still doesn't contain last-active-disabled-services
+ check_state_json_no_disabled_svcs
+ /home/gopath/src/github.com/snapcore/snapd/tests/lib/tools/snapd-state check-state '.data.snaps."disabled-svcs-kept" | ."last-active-disabled-s...

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thank you Michael for verifying this! As per my earlier question during our last-week's sync: do you think it would be possible to have this running for instance as part of the coreXX-edge testing that Sergio is doing? Since we'd like to make sure we don't reintroduce this issue at some point by accident.

Revision history for this message
Lukas Märdian (slyon) wrote :

Thank you for testing the revert!
Do we know if the same issue happens in Focal/UC20 as well? Or can we somehow falsify that? The regressing patches had been introduced to Hirsute/Focal/Bionic to fix LP: #1934147

Revision history for this message
Michael Vogt (mvo) wrote :

Fwiw, the patch debian/lp1934147/0001-core-add-a-new-unit-method-catchup.patch is part of a fairly large patchset (https://github.com/systemd/systemd/pull/9200/commits) so maybe something is missing there?

Revision history for this message
Lukas Märdian (slyon) wrote :

That's certainly possible. In the first attempt (that failed according to mvo's tests) I only backported

https://github.com/systemd/systemd-stable/commit/04eb582acc203eab0bc5c2cc5e13986f16e09df0
and
https://github.com/systemd/systemd-stable/commit/cc6271f17d3c5c200e8a391f10d0afb71403fc28 (as a dependency)

It is quite possible that we need to backport even more commits to fix this without reverting LP: #1934147

Revision history for this message
Michael Vogt (mvo) wrote :

About the question about Focal/hirsute/uc20 - we have not observed the issue there.

Lukas Märdian (slyon)
no longer affects: systemd (Ubuntu Jammy)
Revision history for this message
Alberto Mardegan (mardy) wrote :

I'm coming with bad news. Today both me and Michael have individually retested both snaps you provided (20211102 and 20211105), and the bug happened with both of them. It's possible that yesterday Michael accidentally used QEMU for his tests, without realizing (and in QEMU it seems that this bug is not reproducible).

Here are the spread logs (plus "journalctl -xe") obtained in the 21211105 snap: https://paste.ubuntu.com/p/CnnyXNC2Tt/

Revision history for this message
Dan Streetman (ddstreet) wrote :

> and in QEMU it seems that this bug is not reproducible

can you provide steps on how this is reproducable then?

Revision history for this message
Alberto Mardegan (mardy) wrote :

> can you provide steps on how this is reproducable then?

So far we have not been able to isolate the problem, so the only way so far is to run the spread tests from the snapd directory (from the master branch):

    spread -debug -repeat 20 google:ubuntu-core-18-64:tests/main/services-disabled-kept-happy

Revision history for this message
Alberto Mardegan (mardy) wrote :

An update to the comment above. We have found out a sequence of commands that seems to reproduce the issue reliably: in the spread session started by the command from the comment above, if you type this:

    systemctl stop 'snap-disabled\x2dsvcs\x2dkept-x1.mount'
    systemctl daemon-reload

    # The two commands above are just to ensure that the unit is in the unmounted state
    systemctl start 'snap-disabled\x2dsvcs\x2dkept-x1.mount'
    systemctl daemon-reload
    systemctl stop 'snap-disabled\x2dsvcs\x2dkept-x1.mount'
    systemctl start 'snap-disabled\x2dsvcs\x2dkept-x1.mount'

The last command will hang for 90 seconds, after which the job will fail and `journalctl -xe` will show that it failed because systemd timed out while activating a loop device. In reality, the loop device is already there and is available, but somehow the `deamon-reload` operation broke the internal status.

If you remove the line with `systemctl daemon-reload` (or, on the other hand, add one such line even after the `stop` command), then everything proceeds normally. I'm EOD now, but tomorrow I'll verify if this happens on classic too.

Revision history for this message
Alberto Mardegan (mardy) wrote :
Revision history for this message
Alberto Mardegan (mardy) wrote (last edit ):

One more update: I verified that the steps from comment #17 can be used to reproduce the issue in an Ubuntu Core image started in qemu, and without using the spread. I verified it with the 20211105 image provided by Lukas in https://people.ubuntu.com/~slyon/uc18/ and also with the default amd64 core image from https://cdimage.ubuntu.com/ubuntu-core/18/stable/current/

To reproduce it, start the image in QEMU:

    qemu-system-x86_64 -enable-kvm -smp 2 -m 1500 \
        -netdev user,id=mynet0,hostfwd=tcp::8022-:22,hostfwd=tcp::8090-:80 \
        -device virtio-net-pci,netdev=mynet0 \
        -drive file=~/Downloads/uc18-20211105+systemd+3G.img,format=raw \
        -drive file=/usr/share/OVMF/OVMF_CODE.fd,if=pflash,format=raw,unit=0,readonly=on

configure it as usual, and install a snap. Then:

    sudo snap install hello-world
    sudo systemctl start 'snap-hello\x2dworld-29.mount'
    sudo systemctl daemon-reload
    sudo systemctl stop 'snap-hello\x2dworld-29.mount'
    sudo systemctl start 'snap-hello\x2dworld-29.mount'

The last command will hang for 90 seconds, after which the logs (`sudo journalctl -xe`) will say:

```
Nov 15 06:10:20 localhost systemd[1]: dev-loop6.device: Job dev-loop6.device/start timed out.
Nov 15 06:10:20 localhost systemd[1]: Timed out waiting for device /dev/loop6.
-- Subject: Unit dev-loop6.device has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit dev-loop6.device has failed.
--
-- The result is RESULT.
Nov 15 06:10:20 localhost systemd[1]: Dependency failed for Mount unit for hello-world, revision 29.
-- Subject: Unit snap-hello\x2dworld-29.mount has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit snap-hello\x2dworld-29.mount has failed.
--
-- The result is RESULT.
Nov 15 06:10:20 localhost systemd[1]: snap-hello\x2dworld-29.mount: Job snap-hello\x2dworld-29.mount/start failed with result 'dependency'.
Nov 15 06:10:20 localhost systemd[1]: dev-loop6.device: Job dev-loop6.device/start failed with result 'timeout'.
```

Revision history for this message
Lukas Märdian (slyon) wrote :

Thank you, having a reliable reproducer is very helpful! I can confirm this reproduces locally for me on QEMU.

I prepared a new systemd package [0], containing backports from the above mentioned PR that indeed sounds very related (I wonder why the issue didn't happen before, tho, as this is unrelated to any recent SRU). Using this package, I prepared a new '20211115' core18 base snap [1] and a corresponding UC18 image [2] for further testing.

With this systemd version the reproducer passes for me with no problem. So I guess it's worth giving this another try with the full spread test.

[0] https://launchpad.net/~slyon/+archive/ubuntu/lp1949089
[1] https://people.ubuntu.com/~slyon/uc18/core18_20211115_amd64.snap
[2] https://people.ubuntu.com/~slyon/uc18/uc18-20211115+systemd+3G.img.xz

Revision history for this message
Alberto Mardegan (mardy) wrote :

Thanks Lukas! It's very likely that there are several issues in this area, because while I can confirm that your latest image fixes the reproducer I pasted before (in the Google cloud, too), the spread test failure is still reproducible with the "tests/main/services-disabled-kept-happy" test.

I need to have a closer look at this test case, to understand what exact steps snapd is taking, and see if we can again find a reproducer without involving snapd.

Revision history for this message
Dan Streetman (ddstreet) wrote :

@slyon you might want to see if commit 918e6f1c0151429f5095355f4f3f74f16e79724a fixes this

Revision history for this message
Michael Vogt (mvo) wrote :

I spend a bit of quality time with this bug today and it seems there is (also?) a kernel dimension to it. I ran the following:
"""
$ git status
On branch tests-use-core18-from-gce
Your branch is up to date with 'sergiocazzolato/tests-use-core18-from-gce'.
$ git diff
diff --git a/overlord/devicestate/devicestate.go b/overlord/devicestate/devicestate.go
index d40b32bcbd..77ec3fc0a8 100644
--- a/overlord/devicestate/devicestate.go
+++ b/overlord/devicestate/devicestate.go
@@ -117,6 +117,8 @@ func canAutoRefresh(st *state.State) (bool, error) {
        if !seeded {
                return false, nil
        }
+ // HACK
+ return false, nil

        // Try to ensure we have an accurate time before doing any
        // refreshy stuff. Note that this call will not block.
diff --git a/spread.yaml b/spread.yaml
index 0aef5dea7c..68cbdc09b7 100644
--- a/spread.yaml
+++ b/spread.yaml
@@ -37,7 +37,7 @@ environment:
     MANAGED_DEVICE: "false"
     # a global setting for LXD channel to use in the tests
     LXD_SNAP_CHANNEL: "latest/edge"
- UBUNTU_IMAGE_SNAP_CHANNEL: "latest/candidate"
+ UBUNTU_IMAGE_SNAP_CHANNEL: "beta/1.11"
     CORE_CHANNEL: '$(HOST: echo "${SPREAD_CORE_CHANNEL:-edge}")'
     BASE_CHANNEL: '$(HOST: echo "${SPREAD_BASE_CHANNEL:-edge}")'
     KERNEL_CHANNEL: '$(HOST: echo "${SPREAD_KERNEL_CHANNEL:-edge}")'
diff --git a/tests/lib/prepare.sh b/tests/lib/prepare.sh
index e6b984c4d0..215d0611be 100755
--- a/tests/lib/prepare.sh
+++ b/tests/lib/prepare.sh
@@ -973,8 +973,8 @@ EOF
     fi

     if os.query is-core18; then
- curl -s -o core18.snap https://storage.googleapis.com/snapd-spread-tests/snaps/core18_20211102_amd64.snap
- EXTRA_FUNDAMENTAL="$EXTRA_FUNDAMENTAL --snap $PWD/core18.snap"
+ curl --insecure -o pc-kernel.snap https://people.canonical.com/~mvo/tmp/pYVQrBcKmBa0mZ4CCN7ExT6jH8rY1hza_831.snap
+ EXTRA_FUNDAMENTAL="$EXTRA_FUNDAMENTAL --snap $PWD/pc-kernel.snap"
     fi

     /snap/bin/ubuntu-image snap \
"""
which essentially moves the image to a 2 month older 4.14 kernel. With that I could not reproduce the error for three subsequent runs of 2 repeats.

However when I disabled this code and use the current "18/edge" pc-kernel (or stable, it's the same revision) then I can trigger the bug right away.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> @slyon you might want to see if commit 918e6f1c0151429f5095355f4f3f74f16e79724a fixes this

(and there are a couple follow on commits to this)

Revision history for this message
Lukas Märdian (slyon) wrote :

Thanks ddstreet, I agree 918e6f1c0151429f5095355f4f3f74f16e79724a could be related as well, but unfortunately that commit produced quite some fall-out.

Nevertheless, I've prepared a new systemd build in my PPA and the "core18_20211118_amd64.snap" at https://people.ubuntu.com/~slyon/uc18/ cherry-picking the changes from

https://github.com/systemd/systemd/pull/8675 (incl. 918e6f1c)

As I said, there are quite some other PRs related to this, in order to clean-up after it
https://github.com/systemd/systemd/pull/8798
https://github.com/systemd/systemd/pull/9200
https://github.com/systemd/systemd/pull/9229
https://github.com/systemd/systemd/pull/9366

In total that should be approx. 1000 changed LOC... So I'm not sure how suitable this is for SRU. But let's first give this newest build a try and check if the initial commit/PR changes anything wrt to our core problem here, if so we can into the details of those backports (and hopefully skip manny commits of those).

Revision history for this message
Dan Streetman (ddstreet) wrote :

Note one interesting point is that in a core18 system, a squashfs mount (any loop-based mount, really) includes a dep from the mount to its loop device, while the non-core system does not:

ddstreet@localhost:~$ cat /etc/issue
Ubuntu Core 18 on \4 (\l)

ddstreet@localhost:~$ systemctl list-dependencies snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─-.mount
● ├─dev-loop3.device
● ├─snap.mount
● ├─system.slice
● └─var-lib-snapd.mount
ddstreet@localhost:~$ systemctl list-dependencies --reverse snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─dev-loop3.device
● └─multi-user.target
● └─graphical.target

ubuntu@test-b-full:~$ cat /etc/issue
Ubuntu 18.04.6 LTS \n \l

ubuntu@test-b-full:~$ systemctl list-dependencies snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─-.mount
● └─system.slice
ubuntu@test-b-full:~$ systemctl list-dependencies --reverse snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● └─multi-user.target
● └─graphical.target

Revision history for this message
Dan Streetman (ddstreet) wrote :

Sorry to clarify that, the dep is added to the mount over a daemon-reload, but only on core:

ddstreet@localhost:~$ systemctl list-dependencies snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─-.mount
● ├─snap.mount
● ├─system.slice
● └─var-lib-snapd.mount
ddstreet@localhost:~$ systemctl list-dependencies --reverse snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● └─multi-user.target
● └─graphical.target
ddstreet@localhost:~$ sudo systemctl daemon-reload
ddstreet@localhost:~$ systemctl list-dependencies snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─-.mount
● ├─dev-loop2.device
● ├─snap.mount
● ├─system.slice
● └─var-lib-snapd.mount
ddstreet@localhost:~$ systemctl list-dependencies --reverse snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─dev-loop2.device
● └─multi-user.target
● └─graphical.target

ubuntu@test-b-full:~$ systemctl list-dependencies snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─-.mount
● └─system.slice
ubuntu@test-b-full:~$ systemctl list-dependencies --reverse snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● └─multi-user.target
● └─graphical.target
ubuntu@test-b-full:~$ sudo systemctl daemon-reload
ubuntu@test-b-full:~$ systemctl list-dependencies snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● ├─-.mount
● └─system.slice
ubuntu@test-b-full:~$ systemctl list-dependencies --reverse snap-hello\\x2dworld-29.mount
snap-hello\x2dworld-29.mount
● └─multi-user.target
● └─graphical.target

Revision history for this message
Dan Streetman (ddstreet) wrote :

This added BindsTo is what's causing the problem:

ddstreet@localhost:~$ systemctl show -p BindsTo snap-hello\\x2dworld-29.mount
BindsTo=
ddstreet@localhost:~$ sudo systemctl daemon-reload
ddstreet@localhost:~$ systemctl show -p BindsTo snap-hello\\x2dworld-29.mount
BindsTo=dev-loop6.device

ubuntu@test-b-full:~$ systemctl show -p BindsTo snap-hello\\x2dworld-29.mount
BindsTo=
ubuntu@test-b-full:~$ sudo systemctl daemon-reload
ubuntu@test-b-full:~$ systemctl show -p BindsTo snap-hello\\x2dworld-29.mount
BindsTo=

Once the mount unit BindsTo= the loop device, systemd won't start the mount unit (which actually runs 'mount') until the loop device is 'started' but of course the loop device won't ever be started until 'mount' is called.

Revision history for this message
Michael Vogt (mvo) wrote :

@Dan Thanks, this find about BindsTo is super interesting! It's not something that snapd writes out, it's curious where this comes from.

Revision history for this message
Michael Vogt (mvo) wrote :
Download full text (3.8 KiB)

I spend a bit more time bisecting the various core18 snaps to see at what revision the failure started to become reproducible. Interestingly I found that it started with r2208 AFAICT. It's all a bit annoying because it's a race so I don't trust it 100% but the previous r2206 got 2 good runs in a row but the r2208 failed also two times in a row. The only change there is cloud-init https://people.canonical.com/~mvo/core18-changes/html/edge/'20210928'r2206_'20210930'r2208.html which might explain why we see the failure in our integration tests in GCE but not inside QEMU. Unfortuantely the cloud-init diff is pretty big.

For reference, here is what I ran:
"""
$ git status
On branch tests-use-core18-from-gce
Your branch is up to date with 'sergiocazzolato/tests-use-core18-from-gce'.
$ git describe
2.53.1-460-gd171d434ed
diff --git a/overlord/devicestate/devicestate.go b/overlord/devicestate/devicestate.go
index d40b32bcbd..77ec3fc0a8 100644
--- a/overlord/devicestate/devicestate.go
+++ b/overlord/devicestate/devicestate.go
@@ -117,6 +117,8 @@ func canAutoRefresh(st *state.State) (bool, error) {
        if !seeded {
                return false, nil
        }
+ // HACK
+ return false, nil

        // Try to ensure we have an accurate time before doing any
        // refreshy stuff. Note that this call will not block.
diff --git a/spread.yaml b/spread.yaml
index 0aef5dea7c..68cbdc09b7 100644
--- a/spread.yaml
+++ b/spread.yaml
@@ -37,7 +37,7 @@ environment:
     MANAGED_DEVICE: "false"
     # a global setting for LXD channel to use in the tests
     LXD_SNAP_CHANNEL: "latest/edge"
- UBUNTU_IMAGE_SNAP_CHANNEL: "latest/candidate"
+ UBUNTU_IMAGE_SNAP_CHANNEL: "beta/1.11"
     CORE_CHANNEL: '$(HOST: echo "${SPREAD_CORE_CHANNEL:-edge}")'
     BASE_CHANNEL: '$(HOST: echo "${SPREAD_BASE_CHANNEL:-edge}")'
     KERNEL_CHANNEL: '$(HOST: echo "${SPREAD_KERNEL_CHANNEL:-edge}")'
diff --git a/tests/lib/prepare.sh b/tests/lib/prepare.sh
index e6b984c4d0..14b9608fb9 100755
--- a/tests/lib/prepare.sh
+++ b/tests/lib/prepare.sh
@@ -973,7 +973,26 @@ EOF
     fi

     if os.query is-core18; then
- curl -s -o core18.snap https://storage.googleapis.com/snapd-spread-tests/snaps/core18_20211102_amd64.snap
+ # GOOD
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_2081.snap
+ # GOOD
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_2128.snap
+ # BAD
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_2239.snap
+ # GOOD
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_2141.snap
+ # GOOD (3 good runs in a row)
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_2191.snap
+ # BAD
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_2213.snap
+ # BAD
+ #curl --insecure -o core18.snap https://people.canonical.com/~mvo/tmp/CSO04Jhav2yK0uz...

Read more...

Revision history for this message
Michael Vogt (mvo) wrote :
Revision history for this message
Michael Vogt (mvo) wrote :
Revision history for this message
Michael Vogt (mvo) wrote :

I attached the /run/cloud-init and /var/log/cloud-init logs of the good/bad run - I looked over the diff via "diff -u <(cut -f 4- -d: /tmp/cloud-init-good-core18-r2206/cloud-init.log) <(cut -f 4- -d: /tmp/cloud-init-bad-core18-r2208/cloud-init.log)" but couldn't see anything standing out there (but I'm really not a cloud-init guy).

Alberto mentioned that latest cloud-init adds "hotplug" (https://github.com/canonical/cloud-init/pull/936) which might be related.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> It's not something that snapd writes out, it's curious where this comes from

it's not written out, it's internally handled/added by pid1, and it *should* be there, because you do want (for example) the mount to be stopped if its backing device is removed. However when the mount is stopped, the BindsTo should be removed (internally, by pid1). So there's some problem with non-core bionic systemd that isn't correctly adding the relation at all, and some problem with core bionic systemd that isn't properly removing the relation when the mount is stopped

Revision history for this message
Michael Vogt (mvo) wrote :

Fwiw, it looks like the latest core18 edge build that includes https://launchpad.net/ubuntu/+source/cloud-init/21.4-0ubuntu1~18.04.1 make the problem go away. It seems https://bugs.launchpad.net/cloud-init/+bug/1946003 is what caused it.

The old code had a udev rule that would call into the cloud-init python code for each "net|block" device that got added that spiked the CPU when a block device was added which then made the race that seems to exist much more likely to appear.

Given what Dan said above something deeper seems to be still lurking here that probably needs fixing (but will be harder to trigger now).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.