Activity log for bug #1943419

Date Who What changed Old value New value Message
2021-09-13 07:55:19 Trent Lloyd bug added bug
2021-09-13 07:55:29 Trent Lloyd tags sts
2021-09-13 07:58:25 Trent Lloyd summary LXD snap refresh stuck at copy snap data phase for 16+ hours without making progress LXD snap auto refresh stuck at copy snap data phase for 16+ hours without making progress
2021-09-13 07:58:55 Trent Lloyd description "snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck. A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar). I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/lxd/20330/bin/*, /snap/lxd/20330/lib/*). Nothing was open in the new version path (/snap/lxd/21032). There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc. This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap: https://bugs.launchpad.net/snappy/+bug/1843589 https://github.com/lxc/lxd/issues/6098 The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly. (1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default) (2) strace of snapd for a small period of time (3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump (4) some extra info collected by juju-crashdump Application: Canoncial Anbox Cloud (https://anbox-cloud.io/) - LXD Cluster node Ubuntu: 18.04.5 Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton) [Log Excerpts] root@ip-172-31-39-228:~# snap changes ID Status Spawn Ready Summary 28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd" root@ip-172-31-39-228:~# snap tasks 28 Status Spawn Ready Summary Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable" Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032) Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032) Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd" Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable Doing yesterday at 20:26 UTC - Copy snap "lxd" data Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd" Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd" Do yesterday at 20:26 UTC - Setup snap "lxd" aliases Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present Do yesterday at 20:26 UTC - Run health check of "lxd" snap Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd" [relevant /var/log/syslog] Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.go:551: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core", "core18", "snapd" Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.go:482: Download size for https://api.snapcraft.io/api/v1/snaps/download/J60k4JY0HppjwOjW8dZdYc8obXKxujRu_20330_21032_xdelta3.delta: 39426509 Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities... Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day... Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032... Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032. Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon. Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day. Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon... Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities. Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon. Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. Automatic "snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck. A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar). I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/lxd/20330/bin/*, /snap/lxd/20330/lib/*). Nothing was open in the new version path (/snap/lxd/21032). There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc. This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap: https://bugs.launchpad.net/snappy/+bug/1843589 https://github.com/lxc/lxd/issues/6098 The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly. (1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default) (2) strace of snapd for a small period of time (3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump (4) some extra info collected by juju-crashdump Application: Canoncial Anbox Cloud (https://anbox-cloud.io/) - LXD Cluster node Ubuntu: 18.04.5 Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton) [Log Excerpts] root@ip-172-31-39-228:~# snap changes ID Status Spawn Ready Summary 28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd" root@ip-172-31-39-228:~# snap tasks 28 Status Spawn Ready Summary Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable" Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032) Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032) Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd" Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable Doing yesterday at 20:26 UTC - Copy snap "lxd" data Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd" Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd" Do yesterday at 20:26 UTC - Setup snap "lxd" aliases Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present Do yesterday at 20:26 UTC - Run health check of "lxd" snap Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd" [relevant /var/log/syslog] Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.go:551: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core", "core18", "snapd" Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.go:482: Download size for https://api.snapcraft.io/api/v1/snaps/download/J60k4JY0HppjwOjW8dZdYc8obXKxujRu_20330_21032_xdelta3.delta: 39426509 Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities... Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day... Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032... Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032. Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon. Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day. Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon... Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities. Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon. Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
2021-09-13 08:59:37 Dominique Poulain bug added subscriber Dominique Poulain
2021-09-13 09:39:37 Paweł Stołowski snappy: status New Triaged
2021-09-13 09:39:41 Paweł Stołowski snappy: importance Undecided High
2021-09-13 11:09:40 Paweł Stołowski bug task added snapd
2021-09-13 11:09:46 Paweł Stołowski snapd: status New Triaged
2021-09-13 11:09:48 Paweł Stołowski snapd: importance Undecided High
2021-09-13 11:09:52 Paweł Stołowski bug task deleted snappy