Activity log for bug #2026181

Date Who What changed Old value New value Message
2023-07-05 14:41:54 Simon Fels bug added bug
2023-07-05 14:42:50 Simon Fels description Hey :-) We have a LXD cluster running on six Arm servers (Ampere eMAG and Altra). On the cluster we have a set of VMs we manually registered to MAAS and deploy regularly from our CI for testing. We don't use the builtin pod functionality. Quite often we run into the following problem: Our CI allocates and deploys a machine and MAAS starts to power on the VM. After some time it detects the VM never started and stops the process and marking the deployment as failed. Looking into the logs we see the following: Wed, 05 Jul. 2023 10:09:01 TFTP Request - bootaa64.efi Wed, 05 Jul. 2023 10:08:22 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on. Wed, 05 Jul. 2023 10:08:22 Node changed status - From 'Deploying' to 'Failed deployment' Wed, 05 Jul. 2023 10:08:22 Marking node failed - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on. Wed, 05 Jul. 2023 10:07:45 Powering on Wed, 05 Jul. 2023 10:07:35 Deploying MAAS started to power on the VM at 10:07:45 and detected at 10:08:22 that it was never successfully powered on. This roughly matches the DEFAULT_WAITING_POLICY (35s) in src/provisioningserver/drivers/power/__init__.py Checking the LXD logs, the VM is powered on by MAAS and finishing the start operation 30s later than what MAAS expects: 2023-07-05T10:07:46Z lxd.daemon[3211129]: time="2023-07-05T10:07:46Z" level=debug msg="Start started" instance=vm16 instanceType=virtua l-machine project=default stateful=false [...] 2023-07-05T10:08:53Z lxd.daemon[3211129]: time="2023-07-05T10:08:53Z" level=debug msg="Start finished" instance=vm16 instanceType=virtual-machine project=default stateful=false Some of the VMs have PCI passthrough enabled and may run on a busy system. We tried to shorten the time it takes to finish the start operation but that is not easy. Is there a way to higher the timeout or make it configurable? Thanks! Hey :-) We have a LXD cluster running on six Arm servers (Ampere eMAG and Altra). On the cluster we have a set of VMs we manually registered to MAAS and deploy regularly from our CI for testing. We don't use the builtin pod functionality. Quite often we run into the following problem: Our CI allocates and deploys a machine and MAAS starts to power on the VM. After some time it detects the VM never started and stops the process and marking the deployment as failed. Looking into the logs we see the following:  Wed, 05 Jul. 2023 10:09:01 TFTP Request - bootaa64.efi  Wed, 05 Jul. 2023 10:08:22 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on.  Wed, 05 Jul. 2023 10:08:22 Node changed status - From 'Deploying' to 'Failed deployment'  Wed, 05 Jul. 2023 10:08:22 Marking node failed - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on.  Wed, 05 Jul. 2023 10:07:45 Powering on  Wed, 05 Jul. 2023 10:07:35 Deploying MAAS started to power on the VM at 10:07:45 and detected at 10:08:22 that it was never successfully powered on. This roughly matches the DEFAULT_WAITING_POLICY (35s) in src/provisioningserver/drivers/power/__init__.py Checking the LXD logs, the VM is powered on by MAAS and finishing the start operation 30s later than what MAAS expects: 2023-07-05T10:07:46Z lxd.daemon[3211129]: time="2023-07-05T10:07:46Z" level=debug msg="Start started" instance=vm16 instanceType=virtua l-machine project=default stateful=false [...] 2023-07-05T10:08:53Z lxd.daemon[3211129]: time="2023-07-05T10:08:53Z" level=debug msg="Start finished" instance=vm16 instanceType=virtual-machine project=default stateful=false Some of the VMs have PCI passthrough enabled and may run on a busy system. We tried to shorten the time it takes to finish the start operation but that is not easy. Is there a way to higher the timeout or make it configurable? This is with MAAS 3.3.4. Thanks!
2023-07-06 06:48:05 Alberto Donato maas: status New Triaged
2023-07-06 06:48:07 Alberto Donato maas: importance Undecided Medium
2023-07-06 06:48:15 Alberto Donato maas: milestone 3.5.0