kernel Firmware Bug: TSC ADJUST differs failures during suspend

Bug #2025616 reported by Weichen Wu
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-nvidia (Ubuntu)
New
Undecided
Unassigned

Bug Description

[Summary]
Discoverd kernel error message during suspend stress test
test case id: power-management/suspend_30_cycles_with_reboots

collected log
~~~
High failures:
  s3: 180 failures
========================================
    HIGH Kernel message: [38779.612837] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5266754042. Restoring (x 3)
    HIGH Kernel message: [38839.622411] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5295340520. Restoring (x 3)
    HIGH Kernel message: [38868.467564] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5270514862. Restoring (x 3)
    HIGH Kernel message: [38897.419897] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5275834616. Restoring (x 3)
    HIGH Kernel message: [38926.135900] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5249511520. Restoring (x 3)
    HIGH Kernel message: [38955.114760] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5252454094. Restoring (x 3)
    HIGH Kernel message: [38983.860142] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5270474360. Restoring (x 3)
    HIGH Kernel message: [39012.819868] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5264045578. Restoring (x 3)
~~~

[Failure rate]
1/1

[Additional information]
CID: 201711-25989
SKU: DGX-1 Station
system-manufacturer: NVIDIA
system-product-name: DGX Station
bios-version: 0406
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (40x)
GPU: 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
nvidia-driver-version: 525.105.17
kernel-version: 5.15.0-1028-nvidia

[Stage]
Issue reported and logs collected at a later stage

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/2025616/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
dann frazier (dannf)
affects: ubuntu → linux-nvidia (Ubuntu)
Revision history for this message
dann frazier (dannf) wrote :

I was curious why this wasn't a problem with focal/5.4. I took a look at the last cert run that used focal/5.4[*], and I see these errors in the logs as well. I then went back to the cert run that was used to award focal certification[**] and those errors do *not* appear there.

So either this is an intermittent failure, or likely one of 3 things happened in the interim:
 - The test changed (or was introduced)
 - The kernel changed
 - The firmware changed

The test does not appear to be new - I have not checked if it has changed.

The kernel version between these runs changed from 5.4.0-37.41-generic to 5.4.0-121.137-generic. A change to this kernel code was introduced in between, in 5.4.0-100.113-generic:

commit 7dcfa07b500834c75a4f5043a43f409a3f02bd5e
Author: Feng Tang <email address hidden>
Date: Wed Nov 17 10:37:50 2021 +0800

    x86/tsc: Add a timer to make sure TSC_adjust is always checked

    BugLink: https://bugs.launchpad.net/bugs/1956381

    commit c7719e79347803b8e3b6b50da8c6db410a3012b5 upstream.

That causes the code that *might* print this warning to run every 10 minutes, instead of when the CPU enters idle. But the log shows these messages appearing every 28-29 seconds, so this being the cause seems unlikely.

As for the firmware, the dmidecode output is identical between these runs, which suggests the firmware has not changed.

[*] https://certification.canonical.com/hardware/201711-25989/submission/269263/
[**] https://certification.canonical.com/hardware/201711-25989/submission/172650/

Revision history for this message
Stephen Carr (truck-adel) wrote :

I have the same problem - see attached.

Linux Lenovo-ideapad-520 6.2.0-33-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Oct 02 08:11:31 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 08:11:44 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming
Oct 02 08:41:45 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 08:41:57 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming
Oct 02 09:11:58 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 09:12:12 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming
Oct 02 09:42:12 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 09:42:26 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming

Revision history for this message
Stephen Carr (truck-adel) wrote :

I have discovered that the bug causes Ubuntu 22.04 NOT to suspend to S3 state (deep). Setting the suspend state to S2Idle works.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.