aws: fix hibernation issues on c5.18xlarge
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-aws (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
Hibernation is still unreliable on c5.18xlarge instances, usually the system hibernates correctly, but on resume it either perfoms a regular reboot instead of resuming from hibernation, or the system is completely stuck after the hibernated kernel is loaded in memory (more exactly the system is stuck when the resume callbacks of the hibernated kernel are executed).
[Test plan]
Create a c5.18xlarge instance, run the memory stress test script (the same test script that we are using to stress test hibernation), trigger the hibernate event, trigger the resume event. Repeat a couple of times and the problem is very likely to happen.
[Fix]
Amazon pointed out two fixes that should address both issues:
1) upstream patch "PM: hibernate: flush swap writer after marking": this prevents the regular reboot issue, because it ensures that the I/O is always flushed after, not before, writing the hibernation signature
2) we need to reserve more space for HVC_BOOT_
[Regression potential]
The first patch is touching only the hibernation code, so potential regressions could be experienced only in the hibernation scenario. The second patch is more like a hack at the moment and it's affecting kvmclock. Increasing the size of HVC_BOOT_ARRAY_SIZE could potentially introduce regressions on small sized kvm systems and a better solution would be to allocate the array hv_clock_boot dynamically. And this is actually the proper fix that Amazon is currently working on. When the fix will be published upstream we should apply that one and drop this SAUCE PATCH.
Changed in linux-aws (Ubuntu Focal): | |
status: | New → Fix Committed |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification- needed- focal' to 'verification- done-focal' . If the problem still exists, change the tag 'verification- needed- focal' to 'verification- failed- focal'.
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/ /wiki.ubuntu. com/Testing/ EnableProposed for documentation how to enable and use -proposed. Thank you!