LXD snap auto refresh stuck at copy snap data phase for 16+ hours without making progress
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
Triaged
|
High
|
Unassigned |
Bug Description
Automatic "snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck.
A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar).
I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/
There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc.
This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap:
https:/
https:/
The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly.
(1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default)
(2) strace of snapd for a small period of time
(3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump
(4) some extra info collected by juju-crashdump
Application: Canoncial Anbox Cloud (https:/
Ubuntu: 18.04.5
Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu
snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton)
[Log Excerpts]
root@ip-
ID Status Spawn Ready Summary
28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd"
root@ip-
Status Spawn Ready Summary
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable"
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd"
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable
Doing yesterday at 20:26 UTC - Copy snap "lxd" data
Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles
Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system
Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd"
Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd"
Do yesterday at 20:26 UTC - Setup snap "lxd" aliases
Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services
Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install
Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Run health check of "lxd" snap
Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd"
[relevant /var/log/syslog]
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon...
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities.
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD
Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly
Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon.
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
tags: | added: sts |
summary: |
- LXD snap refresh stuck at copy snap data phase for 16+ hours without - making progress + LXD snap auto refresh stuck at copy snap data phase for 16+ hours + without making progress |
description: | updated |
Changed in snappy: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in snapd: | |
status: | New → Triaged |
importance: | Undecided → High |
no longer affects: | snappy |
I looked at snapd's data copying code but don't see anything obvious - for snap refresh case it's actually rather simple and executes "cp -av .." followed by "sync" for snap's data dirs (/var/snap/ lxd/<revision> , /home/root/ snap/lxd/ <revision> /home/< user>/snap/ lxd/<revision> (in case of lxd only the first two have content on my system) to copy them to new-revision directories. Any errors would appear either in the system log or as task errors (resulting in overall Error of the task and refresh).
I wonder if there is anything in the /var/snap/ lxd/<revision> directory that would confuse "cp". Can you try to run "cp -av /var/snap/ lxd/<revision> /a/temporary/ directory" manually (it may make sense to stop lxd services first, leaving only monitor running to mimic the real refresh). How much data do you have there?