2021-09-13 07:58:55 |
Trent Lloyd |
description |
"snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck.
A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar).
I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/lxd/20330/bin/*, /snap/lxd/20330/lib/*). Nothing was open in the new version path (/snap/lxd/21032).
There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc.
This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap:
https://bugs.launchpad.net/snappy/+bug/1843589
https://github.com/lxc/lxd/issues/6098
The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly.
(1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default)
(2) strace of snapd for a small period of time
(3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump
(4) some extra info collected by juju-crashdump
Application: Canoncial Anbox Cloud (https://anbox-cloud.io/) - LXD Cluster node
Ubuntu: 18.04.5
Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu
snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton)
[Log Excerpts]
root@ip-172-31-39-228:~# snap changes
ID Status Spawn Ready Summary
28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd"
root@ip-172-31-39-228:~# snap tasks 28
Status Spawn Ready Summary
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable"
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd"
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable
Doing yesterday at 20:26 UTC - Copy snap "lxd" data
Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles
Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system
Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd"
Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd"
Do yesterday at 20:26 UTC - Setup snap "lxd" aliases
Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services
Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install
Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Run health check of "lxd" snap
Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd"
[relevant /var/log/syslog]
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.go:551: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core", "core18", "snapd"
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.go:482: Download size for https://api.snapcraft.io/api/v1/snaps/download/J60k4JY0HppjwOjW8dZdYc8obXKxujRu_20330_21032_xdelta3.delta: 39426509
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon...
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities.
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD
Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly
Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon.
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. |
Automatic "snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck.
A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar).
I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/lxd/20330/bin/*, /snap/lxd/20330/lib/*). Nothing was open in the new version path (/snap/lxd/21032).
There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc.
This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap:
https://bugs.launchpad.net/snappy/+bug/1843589
https://github.com/lxc/lxd/issues/6098
The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly.
(1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default)
(2) strace of snapd for a small period of time
(3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump
(4) some extra info collected by juju-crashdump
Application: Canoncial Anbox Cloud (https://anbox-cloud.io/) - LXD Cluster node
Ubuntu: 18.04.5
Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu
snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton)
[Log Excerpts]
root@ip-172-31-39-228:~# snap changes
ID Status Spawn Ready Summary
28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd"
root@ip-172-31-39-228:~# snap tasks 28
Status Spawn Ready Summary
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable"
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd"
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable
Doing yesterday at 20:26 UTC - Copy snap "lxd" data
Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles
Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system
Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd"
Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd"
Do yesterday at 20:26 UTC - Setup snap "lxd" aliases
Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services
Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install
Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Run health check of "lxd" snap
Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd"
[relevant /var/log/syslog]
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.go:551: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core", "core18", "snapd"
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.go:482: Download size for https://api.snapcraft.io/api/v1/snaps/download/J60k4JY0HppjwOjW8dZdYc8obXKxujRu_20330_21032_xdelta3.delta: 39426509
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon...
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities.
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD
Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly
Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon.
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading. |
|