LXD containers die (service killed with status SEGV).

Bug #2028827 reported by Donghun
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Anbox Cloud
Fix Released
High
Simon Fels

Bug Description

Hello. I am a cloud engineer currently evaluating the adoption of anbox. We are conducting tests where we run as many containers executing games as possible. However, within an hour, we are encountering a situation where approximately half of the containers are dying.

First, let me share our testing environment and scenario.

---------------------------------------------------------------------------------------------------------------------------------------------------
[master server]

CPU : Intel(R) Xeon(R) Platinum 8452Y, 144core(HT on)
Mem : 64G
no GPU

[Worker server]

CPU : AMD EPYC 7763 64-Core Processor, 128core(HT on)
Mem : 1T
no GPU

[master server] and [worker server] connected as a Kubernetes cluster.

We have created a VM using kubevirt to install Anbox. The VM is created on the worker server. Below are the specifications of the VM:

[vm]
cores : 120
Mem : 512G
Storage : 500G
---------------------------------------------------------------------------------------------------------------------------------------------------

[Test scenario]
The command below is executed every 2 minutes to create a total of 40 containers. These containers run games as applications, and various types of games are being executed.

$ amc launch bs-stress -p webrtc --userdata '{"display_width":1280,"display_height":720,"fps":60}' --no-wait

I have attached the relevant logs below. Please review them.

Under the condition that our environment is unchangeable, are there any ways to improve the situation? I'll await your response. Thank you.

Revision history for this message
Donghun (dhchantels) wrote :
Revision history for this message
Donghun (dhchantels) wrote :
Revision history for this message
Donghun (dhchantels) wrote :
Revision history for this message
Simon Fels (morphis) wrote :

Hey Donghun,

thanks for your bug report.

> The command below is executed every 2 minutes to create a total of 40 containers. These containers run games as applications, and various types of games are being executed.

Can you clarify this? Do you mean you create additional 40 containers every 2 minutes or do you mean that you delete and recreate 40 containers every two minutes?

Looking at your logs it seems like you're overcomitting your system quite a bit. Your containers use the a8.3 instance type and as you don't have a GPU installed they will use LLVMPipe software rendering. LLVMPipe will spin up multiple processing threads but only as much as CPU cores are assigned to the containers. So in this case LLVMPipe will use 8 CPU cores and render at 720p and 60 FPS. Doing that with the bombsquad stress test is giving it a decent workload using quite a bit of the available CPU time.

Given that your system has 120 cores you can fit 15 containers without over committing any. As you spin up 40 you almost triple the load on each core.

By default Anbox Cloud currently using CPU time assignments to limit CPU access for the containers but you can force pinning to be used by setting

$ amc config set cpu.limit_mode pinning

That will pin cores to your containers.

What I recommend is that you start using a smaller number of CPU cores assigned to your containers and enable CPU pinning (you have to restart all containers to get this applied). a4.3 or a2.3 are good starting points.

Looking at the logs you provided in https://bugs.launchpad.net/anbox-cloud/+bug/2028827/comments/1 the .dmp file seems to be corrupt. How did you export it from AMS?

However this is likely a crash due to a bug in latest upstream Mesa we're using in Anbox Cloud 1.18.2 which will be resolved in the upcoming 1.19 release.

Furthermore for an actual evaluation of Anbox Cloud for a cloud gaming solution I strongly recommend using GPUs. The software backend is not necessarily optimized to run games efficiently and will decrease achievable density as both rendering and video encoding happen on the CPU. For a real benchmark you will really want to extend your system with GPUs.

Simon Fels (morphis)
Changed in anbox-cloud:
milestone: none → 1.19.0
importance: Undecided → High
assignee: nobody → Simon Fels (morphis)
status: New → Fix Committed
Revision history for this message
Donghun (dhchantels) wrote :

Yes, containers are continuously added at two-minute intervals without deletion until there are a total of 40 containers.

Dmp files are practically meaningless unless someone is a developer capable of analyzing them.

I have received a very satisfactory answer and appreciate you taking the time to check.

Simon Fels (morphis)
Changed in anbox-cloud:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.