Sporadic failures for Ubuntu EKS nodes joining clusters with certain AMI versions: Make current revision for snap "aws-cli" unavailable

Bug #2036848 reported by Wesley Yep
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
New
Undecided
Unassigned

Bug Description

We had some sporadic issues over the last 2-3 weeks with our EKS Ubuntu nodes failing to join our clusters.
We use Karpenter to manage nodes, and the nodes fail to join and eventually get killed after reaching the 15min registration TTL.

It seems to have been affecting nodes built from the following AMIs:
- ami-0a07e041af3e47600 ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230903
- ami-064589d23768e0c2c ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230906

A new AMI build yesterday from the following AMI seems to no longer have any issues:
- ami-0384e492712cd1d70 ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230919

From the user-data.log output of the nodes that failed to join the cluster, I see the following errors:
```
Configuring kubelet snap
error: cannot perform the following tasks:
- Make current revision for snap "aws-cli" unavailable (snap "aws-cli" has running apps (aws), pids: 1132)
```

A healthy node that joins the cluster logs this:
```
Configuring kubelet snap
Starting k8s kubelet daemon
Started.
```

Just wondering whether there were any previous bugs or issues which may have caused these errors in the AMIs mentioned above?

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Thanks for the bug report, Wesley,

We had a related bug in the past (see https://bugs.launchpad.net/cloud-images/+bug/2012689) and added fixes for that in august 2023. I think the problem is unrelated to the AMI. It depends on if there's a update for a snap (in this case the aws-cli snap) available.

I think this problem is a new one. we'll investigate.

summary: Sporadic failures for Ubuntu EKS nodes joining clusters with certain AMI
- versions
+ versions: Make current revision for snap "aws-cli" unavailable
Revision history for this message
Raoni Timo de Castro Cambiaghi (raonitimo) wrote :

I work with Wesley, and we got a bit more information. These are example log lines we got:

Sep 26, 2023 @ 10:53:02.109 storehelpers.go:773: cannot refresh snap "aws-cli": snap has no updates available
Sep 26, 2023 @ 10:52:59.295 storehelpers.go:773: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core18", "core20", "core22", "kubectl-eks", "kubelet-eks", "snapd"
Sep 26, 2023 @ 10:52:58.524 handlers.go:677: Reported install problem for "aws-cli" as Crash report successfully submitted.
Sep 26, 2023 @ 10:52:58.089 taskrunner.go:299: [change 3 "Make current revision for snap \"aws-cli\" unavailable" task] failed: snap "aws-cli" has running apps (aws), pids: 1216
Sep 26, 2023 @ 10:52:47.992 storehelpers.go:773: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core18", "core20", "core22", "kubectl-eks", "kubelet-eks", "snapd"
Sep 26, 2023 @ 10:52:46.821 backends.go:58: AppArmor status: apparmor is enabled and all features are available (using snapd provided apparmor_parser)
=Sep 26, 2023 @ 10:52:46.029 daemon.go:340: adjusting startup timeout by 1m10s (pessimistic estimate of 30s plus 5s per snap)
Sep 26, 2023 @ 10:52:45.961 daemon.go:247: started snapd/2.60.3 (series 16; classic) ubuntu/20.04 (amd64) linux/5.15.0-1045-aws.
Sep 26, 2023 @ 10:52:45.857 overlord.go:272: Acquiring state lock file
Sep 26, 2023 @ 10:52:45.857 overlord.go:277: Acquired state lock file
Sep 26, 2023 @ 10:52:27.445 main.go:124: Loading profiles [/var/lib/snapd/apparmor/profiles/snap-confine.snapd.20092 /var/lib/snapd/apparmor/profiles/snap-update-ns.amazon-ssm-agent /var/lib/snapd/apparmor/profiles/snap-update-ns.aws-cli /var/lib/snapd/apparmor/profiles/snap-update-ns.kubectl-eks /var/lib/snapd/apparmor/profiles/snap-update-ns.kubelet-eks /var/lib/snapd/apparmor/profiles/snap.amazon-ssm-agent.amazon-ssm-agent /var/lib/snapd/apparmor/profiles/snap.amazon-ssm-agent.ssm-cli /var/lib/snapd/apparmor/profiles/snap.aws-cli.aws /var/lib/snapd/apparmor/profiles/snap.kubectl-eks.kubectl /var/lib/snapd/apparmor/profiles/snap.kubelet-eks.daemon /var/lib/snapd/apparmor/profiles/snap.kubelet-eks.hook.configure /var/lib/snapd/apparmor/profiles/snap.kubelet-eks.kubelet]

My guess is we're trying to update aws-cli during boot while `aws` is running.

IMO, we shouldn't try to update `aws-cli` on boot. Should we disable this update in the AMI?

Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.