machines fail to boot if MAAS doesn't respond to cloud-init

Bug #1910552 reported by Laurent Sesquès
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Medium
Adam Collard
MAAS documentation
New
Undecided
Unassigned
cloud-init
Fix Released
Medium
Unassigned

Bug Description

We have a recurring issue on a MAAS 2.3.7 (xenial), where once in a while we need to restart rackd and regiond to make maas respond to machines rebooting.
This itself would be a different bug though.
What I'd like to report here is that a machine should be able to finish its boot sequence even if it can't talk to the MAAS API.

Observed behaviour:

[ OK ] Started Raise network interfaces.
[ OK ] Reached target Network.
         Starting Initial cloud-init job (metadata service crawler)...
(stuck here indefinitely)

(restart rackd and regiond)

the machine reboots successfully.

Revision history for this message
Alberto Donato (ack) wrote :

Does this happen during deployment or regular machine boot after deployment?

Changed in maas:
status: New → Incomplete
Revision history for this message
Laurent Sesquès (sajoupa) wrote :

This is during regular reboots after deployment.

Changed in maas:
status: Incomplete → New
Revision history for this message
Björn Tillenius (bjornt) wrote :

I agree that we should set things up so that they rely on MAAS as little as possible after they are deployed.

We'll have to look into whether we can configure cloud-init not to talk to MAAS completely, or at least fail gracefullly.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Subscribing field-high as we're seeing this repeatedly in prod, preventing nodes from coming up

Revision history for this message
James Falcon (falcojr) wrote :

Can we get the cloud-init logs post reboot? "cloud-init collect-logs" (with a -u if there's no sensitive userdata involved).

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Laurent Sesquès (sajoupa) wrote :

I don't have the logs from the issue which led to the original report here.
However it should be straightforward to reproduce.
FTR, I tested locally on a setup using MAAS 2.8.6, and couldn't reproduce. The logs show failures to post cloud-init events, but they're not blocking. I attached the logs.

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

The reason why I couldn't reproduce is because I stopped the regionds, so cloud-init gets an immediate connection refused and moves on. To reproduce, we'd need MAAS to accept the connection and just hang indefinitely, which I can't see right now how to reproduce.

Revision history for this message
James Falcon (falcojr) wrote :

"we'd need MAAS to accept the connection and just hang indefinitely, which I can't see right now how to reproduce."

Are you saying that cloud-init is attempting to read metadata from MAAS, but then MAAS accepts the request but never sends a response?

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

"Are you saying that cloud-init is attempting to read metadata from MAAS, but then MAAS accepts the request but never sends a response?"

Yes. We also observed that if the packets are DROPed between the node and MAAS, then the boot process will be stuck as well.
REJECTing the packets allows the connection to be refused so the boot sequence immediately moves on.
(this happened to us on a node whose MAAS had been decommissioned, and corresponding firewall rules removed. Adding a rule to REJECT the packets had unblocked the boot sequence).

I could reproduce in a local test environment, DROPing packets along the way. I attached the logs. (The logs should show that after a few minutes, the node succeeds contacting MAAS. That's because I removed the rule at that point.).

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
James Falcon (falcojr) wrote :

Hey Laurent. I looked over the logs, and I'm not sure where the issue occurred. There are a number of large time gaps, but they seem to be between invocations of cloud-init. Given how long the log is, can you help me pinpoint where the issue occurred?

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

The issue is visible when the machine boots on 2021-09-15 09:49. cloud-init fails to communicate with MAAS until 10:06:20, which is when I removed the DROP rule.

Revision history for this message
James Falcon (falcojr) wrote :

Thanks. It looks like the issue is that a reporting URL has been setup, so logs are setup to be posted back to MAAS. This happens synchronously, so even though each request has a 20 second timeout, when there are dozens/hundreds of logs to send, each timeout adds up to look like cloud-init has stalled.

Changed in cloud-init:
status: New → Triaged
importance: Undecided → Medium
Changed in maas:
milestone: none → next
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We want to change configuration of cloud-init to prevent attempts to contact MAAS after the first boot.

Changed in maas:
milestone: next → 3.3.0
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Triaged → Fix Committed
Revision history for this message
Brett Holman (holmanb) wrote : Fixed in cloud-init version 22.3.

This bug is believed to be fixed in cloud-init in version 22.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Changed in maas:
milestone: 3.3.0 → 3.4.0
no longer affects: maas/3.3
Revision history for this message
Adam Collard (adam-collard) wrote (last edit ):

Deployed a machine using an alpha build of MAAS 3.4.0 with cloud-init 23.1.1-0ubuntu0~20.04.1

From initial deployment

ubuntu@petrel:~$ cloud-init analyze boot
-- Most Recent Boot Record --
    Kernel Started at: 2023-04-21 10:21:24.408999
    Kernel ended boot at: 2023-04-21 10:21:28.556996
    Kernel time to boot (seconds): 4.14799690246582
    Cloud-init activated by systemd at: 2023-04-21 10:21:33.759920
    Time between Kernel end boot and Cloud-init activation (seconds): 5.202924013137817
    Cloud-init start: 2023-04-21 10:21:36.521000
successful

Then rebooted with MAAS still available

ubuntu@petrel:~$ cloud-init analyze boot
-- Most Recent Boot Record --
    Kernel Started at: 2023-04-21 10:29:39.458092
    Kernel ended boot at: 2023-04-21 10:29:43.598001
    Kernel time to boot (seconds): 4.139909029006958
    Cloud-init activated by systemd at: 2023-04-21 10:29:48.546367
    Time between Kernel end boot and Cloud-init activation (seconds): 4.948365926742554
    Cloud-init start: 2023-04-21 10:29:51.154000
successful

Then turned off MAAS, and rebooted a second time

ubuntu@petrel:~$ cloud-init analyze boot
-- Most Recent Boot Record --
    Kernel Started at: 2023-04-21 10:34:03.435137
    Kernel ended boot at: 2023-04-21 10:34:07.571495
    Kernel time to boot (seconds): 4.136358022689819
    Cloud-init activated by systemd at: 2023-04-21 10:34:12.638482
    Time between Kernel end boot and Cloud-init activation (seconds): 5.066987037658691
    Cloud-init start: 2023-04-21 10:34:15.266000
successful

and confirmed in the cloud-init logs that we saw the expected

2023-04-21 10:34:27,759 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.

Changed in maas:
assignee: nobody → Adam Collard (adam-collard)
status: Triaged → Won't Fix
status: Won't Fix → Fix Released
status: Fix Released → Won't Fix
status: Won't Fix → Invalid
Revision history for this message
Adam Collard (adam-collard) wrote :

For the docs please note this as a fixed bug

Revision history for this message
Junien Fridrick (axino) wrote :

@adam-collard hello ! Thanks for progressing this issue.

I believe stopping MAAS isn't a good way to verify that this issue is fixed, see comments 7 to 9 as to why. Basically, we experience this bug in two different scenarios :

a) packets to the MAAS servers are iptables-DROPed. Adding a REJECT rules works around the problem.

b) MAAS accepts the TCP connection, but does nothing with it (because it's spinning on CPU for some reason). Restarting (or stopping) MAAS works around the problem.

Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-beta1
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.