[SRU] OOM errors with new kernels on resuming
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ec2-hibinit-agent (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Xenial |
Incomplete
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Undecided
|
Unassigned | ||
Eoan |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* During resuming EC2 instances from hibernation sometimes processes are killed OOM manager.
[Test Case]
* Set up an EC2 instance to allow hibernation as the stop instance action.
* Start the attached Python script in a screen session to reserve 85% of the memory:
python3 mem-waster-pct.py -p 85
* Log out, hibernate, then resume the instance.
* Observe the Python script still running after resuming
[Regression Potential]
* The fix is setting memory overcommit policy to 'always overcommit' while removing the swap file. This helps dealing with the shrinking swap space during the swap removal. There is no expected side effect, since processes trying to allocate excessive amount of memory would fail with stricter policies, too.
The fix introduces a potential race condition with processes detecting the overcommit policy:
The policy used when the hibernation took place is saved shortly after resuming and it is restored after the swap file is removed. In this time window other processes detect the policy as 'always overcommit', despite it may not have been set as such before hibernation and may be restored to a different policy after removing the swap file. Hitting this race condition seems to be unlikely and there seem to be no good way of avoiding it.
Changed in ec2-hibinit-agent (Ubuntu): | |
assignee: | nobody → Balint Reczey (rbalint) |
Changed in ec2-hibinit-agent (Ubuntu): | |
status: | New → Incomplete |
assignee: | Balint Reczey (rbalint) → nobody |
tags: | added: id-5e459f823f8a2435d44842eb |
description: | updated |
@fginther Could you please add reproduction steps for the SRU process?