Sporadic failures for Ubuntu EKS nodes joining clusters with certain AMI versions: Make current revision for snap "aws-cli" unavailable
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-images |
New
|
Undecided
|
Unassigned |
Bug Description
We had some sporadic issues over the last 2-3 weeks with our EKS Ubuntu nodes failing to join our clusters.
We use Karpenter to manage nodes, and the nodes fail to join and eventually get killed after reaching the 15min registration TTL.
It seems to have been affecting nodes built from the following AMIs:
- ami-0a07e041af3
- ami-064589d2376
A new AMI build yesterday from the following AMI seems to no longer have any issues:
- ami-0384e492712
From the user-data.log output of the nodes that failed to join the cluster, I see the following errors:
```
Configuring kubelet snap
error: cannot perform the following tasks:
- Make current revision for snap "aws-cli" unavailable (snap "aws-cli" has running apps (aws), pids: 1132)
```
A healthy node that joins the cluster logs this:
```
Configuring kubelet snap
Starting k8s kubelet daemon
Started.
```
Just wondering whether there were any previous bugs or issues which may have caused these errors in the AMIs mentioned above?
Thanks for the bug report, Wesley,
We had a related bug in the past (see https:/ /bugs.launchpad .net/cloud- images/ +bug/2012689) and added fixes for that in august 2023. I think the problem is unrelated to the AMI. It depends on if there's a update for a snap (in this case the aws-cli snap) available.
I think this problem is a new one. we'll investigate.