MAAS power-on timeout is too low for LXD
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Triaged
|
Medium
|
Unassigned |
Bug Description
Hey :-)
We have a LXD cluster running on six Arm servers (Ampere eMAG and Altra). On the cluster we have a set of VMs we manually registered to MAAS and deploy regularly from our CI for testing. We don't use the builtin pod functionality.
Quite often we run into the following problem: Our CI allocates and deploys a machine and MAAS starts to power on the VM. After some time it detects the VM never started and stops the process and marking the deployment as failed. Looking into the logs we see the following:
Wed, 05 Jul. 2023 10:09:01 TFTP Request - bootaa64.efi
Wed, 05 Jul. 2023 10:08:22 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on.
Wed, 05 Jul. 2023 10:08:22 Node changed status - From 'Deploying' to 'Failed deployment'
Wed, 05 Jul. 2023 10:08:22 Marking node failed - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on.
Wed, 05 Jul. 2023 10:07:45 Powering on
Wed, 05 Jul. 2023 10:07:35 Deploying
MAAS started to power on the VM at 10:07:45 and detected at 10:08:22 that it was never successfully powered on. This roughly matches the DEFAULT_
Checking the LXD logs, the VM is powered on by MAAS and finishing the start operation 30s later than what MAAS expects:
2023-07-
l-machine project=default stateful=false
[...]
2023-07-
Some of the VMs have PCI passthrough enabled and may run on a busy system. We tried to shorten the time it takes to finish the start operation but that is not easy. Is there a way to higher the timeout or make it configurable?
This is with MAAS 3.3.4.
Thanks!
description: | updated |
Changed in maas: | |
status: | New → Triaged |
importance: | Undecided → Medium |
milestone: | none → 3.5.0 |