multiple problems with undo for 'snap remove'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
In Progress
|
High
|
Paweł Stołowski |
Bug Description
Undo for 'snap remove' is not fully implemented and leads to inconsistent/broken snap that cannot be removed nor refreshed. This becomes and issue if snap fails to remove, e.g. due to a problem with removal of its data. Easy to reproduce with lxd snap:
snap install lxd
snap stop lxd
snap start lxd
snap refresh --edge lxd
snap remove lxd --purge
The last step fails:
error: cannot perform the following tasks:
- Stop snap "lxd" services ([is-enabled snap.lxd.
)
- Remove security profile for snap "lxd" (17605) (cannot find installed snap "lxd" at revision 17605: missing file /snap/lxd/
- Remove data for snap "lxd" (17597) (unlinkat /var/snap/
- Disconnect lxd:lxd-support from core:lxd-support (snap "lxd" has no "lxd-support" plug)
... (remaining plugs listed)
The failing change:
Status Spawn Ready Summary
Error today at 11:18 UTC today at 11:18 UTC Stop snap "lxd" services
Undone today at 11:18 UTC today at 11:18 UTC Run remove hook of "lxd" snap if present
Done today at 11:18 UTC today at 11:18 UTC Disconnect interfaces of snap "lxd"
Undone today at 11:18 UTC today at 11:18 UTC Remove aliases for snap "lxd"
Done today at 11:18 UTC today at 11:18 UTC Make snap "lxd" unavailable to the system
Error today at 11:18 UTC today at 11:18 UTC Remove security profile for snap "lxd" (17605)
Done today at 11:18 UTC today at 11:18 UTC Remove data for snap "lxd" (17605)
Done today at 11:18 UTC today at 11:18 UTC Remove snap "lxd" (17605) from the system
Error today at 11:18 UTC today at 11:18 UTC Remove data for snap "lxd" (17597)
Hold today at 11:18 UTC today at 11:18 UTC Remove snap "lxd" (17597) from the system
Error today at 11:18 UTC today at 11:18 UTC Disconnect lxd:lxd-support from core:lxd-support
Error today at 11:18 UTC today at 11:18 UTC Disconnect lxd:system-observe from core:system-observe
Error today at 11:18 UTC today at 11:18 UTC Disconnect lxd:network-bind from core:network-bind
Error today at 11:18 UTC today at 11:18 UTC Disconnect lxd:network from core:network
.......
Stop snap "lxd" services
2020-10-
2020-10-
.......
Remove security profile for snap "lxd" (17605)
2020-10-
.......
Remove data for snap "lxd" (17597)
2020-10-
.......
Disconnect lxd:lxd-support from core:lxd-support
2020-10-
.......
Disconnect lxd:system-observe from core:system-observe
2020-10-
.......
Disconnect lxd:network-bind from core:network-bind
2020-10-
.......
Disconnect lxd:network from core:network
2020-10-
I've identified the following fundamental problems with undo for remove:
1. The clear-snap data task (Remove data for snap "lxd"...) can fail if it cannot remove a file that belongs to the snap, in this case it fails on /var/snap/
2. The unlink-snap task (Make snap "lxd" unavailable to the system) doesn't have undo handler, so "current" symlink is not restored even if it could (i.e. if the snap itself wasn't removed).
3. When we remove all the revisions on snap remove, we don't pay attention to the order and afaict current revision appears first. In the above example, 17605 was the current revision and it got successfully and completely removed; we failed later on removing snap data of an inactive old revision 17597 (before removing the snap itself). This means that this revision becomes "current" in a sense, but task snap-setup doesn't reflect it, and existing undo handlers (such as undo for setup-profiles) don't expect it as we roll everything back; on the task we remember the old (now completely gone) revision 17605.
In general, undoing remove is tricky and not always possible, but we should strive to keep things consistent and not leave a snap in a state, which it is "broken" and nothing can be done with it, even if its snap data was already removed.
I think the following could be done to rectify:
- make clear-snap data robust and ignore errors when removing snap data.
- implement undo for unlink-snap, so if we fail to remove some revisions, we restore "current" symlink properly. Perhaps set 'broken' flag on the snap we removed snap data already.
- reorder tasks for removing all revisions so that current revision is last. This should fix the 3rd problem.
Changed in snapd: | |
assignee: | nobody → Paweł Stołowski (stolowski) |
importance: | Undecided → High |
Changed in snapd: | |
status: | New → In Progress |
2 out of 3 points mentioned above have PRs:
https:/ /github. com/snapcore/ snapd/pull/ 9511 /github. com/snapcore/ snapd/pull/ 9522
https:/