Latest image breaks Juju on LXD 4.0

Bug #1991051 reported by Simon Fels
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
Undecided
Unassigned
cloud-init
Invalid
Undecided
Unassigned
lxd
New
Undecided
Unassigned
cloud-init (Ubuntu)
Triaged
Undecided
Unassigned

Bug Description

Ubuntu focal comes with LXD 4.0 installed by default. When using Juju to create containers on a local LXD with focal based instances, the deployment breaks due to and incorrect cloud-init configuration (see https://github.com/lxc/lxd/issues/10951, which existed since 2015!). A fix is not meant to arrive in LXD 4.0 till November (see https://github.com/lxc/lxd/issues/10951#issuecomment-1258691195).

The new behavior in cloud-init has real implications as right now our commercial offering of the Anbox Cloud Appliance on the AWS marketplace (https://aws.amazon.com/marketplace/pp/prodview-aqmdt52vqs5qk) is broken when people buy it without a simple path to fix, other than forcing users to upgrade to LXD 5.0. We will hotfix this and ask people at installation time to upgrade to LXD 5.0.

However, have we considered rolling back the cloud-init update as it's causing issues for a variety of people? I know about one other product which broke due to this (OSM).

Also see https://bugs.launchpad.net/juju/+bug/1990594

Simon Fels (morphis)
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Given that LXD takes quite a while to resolve this I wonder if there is something between "waiting 5 weeks for fixed LXD" and "rolling back all of cloud-init".
Something small in cloud-init maybe that helps to mitigate it until fixed in LXD?

Up to Chad + Squad to discuss and decide.

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

It has broken as well any capability from Bionic host, running on LXD 3.0.3, to create any new container in a deployment.

This has been reproduced on my side with :
* Bionic as host
* Focal as host with LXD 4.24

I'm attaching a log file from one of the container with Cloud-init failing to configure networking.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Were seeing this with SQA too. Bionic deployments can't get lxd machines and the controller is showing these messages:

1/baremetal/var/log/juju/machine-1.log:2022-09-28 05:16:00 DEBUG juju.apiserver.client status.go:1052 no IP addresses fetched for machine "juju-a46b91-3-lxd-0"
1/baremetal/var/log/juju/machine-1.log-2022-09-28 05:16:00 DEBUG juju.apiserver.client status.go:1046 error fetching public address: "no public address(es)"

Logs and configs for this testrun can be found here:
https://oil-jenkins.canonical.com/artifacts/f807cbc3-d539-4f15-8bcd-4a56de341d09/index.html

Brett Holman (holmanb)
Changed in cloud-init:
status: New → Invalid
Revision history for this message
Simon Fels (morphis) wrote :

Turns out this a regression introduced of https://bugs.launchpad.net/cloud-images/+bug/1988401

Revision history for this message
Brett Holman (holmanb) wrote :

I just added LXD to this bug and set the downstream cloud-init status to "invalid" since this is a bug in lxd and expected behavior in cloud-init.

I'm leaving the Ubuntu cloud-init package as triaged for now.

More details to come in a followup comment.

Changed in cloud-init (Ubuntu):
status: New → Triaged
Revision history for this message
Brett Holman (holmanb) wrote :

Cloud-init behavior is governed by datasource detection, in this case the behavior change is due to nocloud templates having been removed from the images. The nocloud templates on the filesystem for cloud-init do not exist on these images to tell cloud-init to use the old datasource, so there is no "roll back" possible for cloud-init, since that would result in an unbootable system.

It might be possible to reinstate the template files in our images to revert this change in behavior, however whether that is actually a better solution than waiting for lxd to backport the bugfix is unknown.

summary: - Latest cloud-init release breaks Juju on LXD 4.0
+ Latest image breaks Juju on LXD 4.0
Revision history for this message
Chad Smith (chad.smith) wrote :

Yes per #4. The shift to drop NoCloud metadata templates from cloud image streams forces cloud-init to use the DatasourceLXD and consume data from `/dev/lxd/sock` instead of the rendered no-cloud seed files that get written to /var/lib/cloud/seed/nocloud-net/. Turns out the datasurfaced from `/dev/lxd/sock` was formatted using Printf instread of Print (and fixed in upstream LXD already) which only affects the `dev/lxd/sock` path.

There is a related upstream issue Simon filed
https://github.com/lxc/lxd/issues/10951 which looks like it already has a backport commit from @tomparrott. If that commit gets released to LXD.4.0 stable branch easily then we can keep the cloud images in their current state. But, if a stable backport of the fix in LXD takes a while I suggest we have the cloud-images revert https://github.com/lxc/lxd/issues/10951

Revision history for this message
Brett Holman (holmanb) wrote :

I am reverting duplicate status to make this visible for anybody that sees similar symptoms. Due to the nature of this bug with multiple projects interacting together to cause unwanted behavior, it's probably best to keep this discoverable across projects.

To avoid breaking users, a change is being reverted[1] in image build to avoid hitting this bug in lxd. There are mitigations happening in multiple other projects as well.

[1] https://bugs.launchpad.net/cloud-images/+bug/1988401

Brett Holman (holmanb)
Changed in cloud-images:
status: New → In Progress
Revision history for this message
Brett Holman (holmanb) wrote :

This was caused by the following change https://bugs.launchpad.net/cloud-images/+bug/1988401 and is expected to be reverted in the next image build. This should be available within the next couple of days.

John Chittum (jchittum)
Changed in cloud-images:
status: In Progress → Fix Released
Revision history for this message
John Chittum (jchittum) wrote :

setting cloud-images to Fixed-Released. We have released the reversion, fixing the bug.

The backport to LXD 4.0 was due in November, and cloud-images now needs to track if the ability to use the cloud-init datasource is available for Focal hosts (4.X). I'll keep this open with a link across to https://bugs.launchpad.net/cloud-images/+bug/1988401

that way wherever folks land, they get some info some how

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.