Soft lockup due to interrupt storm from smbus

Bug #1931001 reported by vcarceler
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
Fedora
Confirmed
Undecided
linux (Ubuntu)
Incomplete
Undecided
Unassigned
linux-hwe-5.11 (Ubuntu)
Confirmed
Undecided
Unassigned
linux-hwe-5.13 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Ubuntu 20.04 LTS and Ubuntu 21.04 occasionally boots with very bad performance and very unresponsive to user input on Lenovo laptop Lenovo 300e 2nd Gen 81M9 (LENOVO_MT_81M9_BU_idea_FM_300e 2nd G).

When this happens you can read this kind of messages on journal:

---
root@alumne-1-58:~# journalctl | grep "BUG: soft"
may 20 21:44:35 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
may 20 21:44:35 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
may 22 09:33:34 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
may 24 16:45:14 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [prometheus-node:4220]
may 24 16:45:14 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
jun 03 00:01:09 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
jun 03 00:01:09 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
jun 03 00:01:09 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
jun 03 00:01:09 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
jun 03 00:02:15 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [swapper/0:0]
jun 05 08:22:58 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/138-iwlwifi:1044]
jun 05 08:25:06 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0]
jun 05 08:25:06 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/138-iwlwifi:1044]
jun 05 08:26:42 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lxd:3975]
jun 05 08:26:42 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [swapper/2:0]
jun 05 08:26:42 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [irq/138-iwlwifi:1044]
jun 05 08:27:38 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [irq/138-iwlwifi:1044]
jun 05 08:28:34 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/138-iwlwifi:1044]
jun 05 08:29:46 alumne-1-58 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/138-iwlwifi:1044]
root@alumne-1-58:~#
---

Usually if you reboot everything works fine but it's very annoying when happens.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

I've noticed, when I do enable CONFIG_SENSORS_JC42 as a module or build into
my kernel, this causes a very high rate of interrupts on i801_smbus - about
6000-8000 per second according to /proc/interrupts. After 20 minutes, there
were about 5 million interrupts generated on i801_smbus.

When I do unload the module jc42, the interrupts do not stop, until I do a
complete reboot.

Mainboard: Supermicro A1SRM-2758F
Kernel: Gentoo-Sources 4.8.1 (Happens also with Vanilla 4.8.1 and older kernel
versions)

dmesg:
[ 8.319900] i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
[ 8.321864] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
[ 8.326098] ismt_smbus 0000:00:13.0: enabling device (0140 -> 0142)

lspci:
00:1f.3 SMBus: Intel Corporation Atom processor C2000 PCU SMBus (rev 02)

When the module is loaded, I am also getting this errors:
[ 73.934901] ismt_smbus 0000:00:13.0: completion wait timed out
[ 74.974970] ismt_smbus 0000:00:13.0: completion wait timed out
[ 76.014949] ismt_smbus 0000:00:13.0: completion wait timed out
[ 77.054903] ismt_smbus 0000:00:13.0: completion wait timed out
[ 78.094961] ismt_smbus 0000:00:13.0: completion wait timed out
[ 79.134982] ismt_smbus 0000:00:13.0: completion wait timed out
[ 80.175116] ismt_smbus 0000:00:13.0: completion wait timed out
[ 81.215057] ismt_smbus 0000:00:13.0: completion wait timed out

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

The jc42 module seems to work, as lm_sensors do find the sensors, after loading it:

Galactica ~ # sensors
jc42-i2c-1-19
Adapter: SMBus I801 adapter at e000
temp1: +30.8°C (low = +0.0°C) ALARM (HIGH, CRIT)
                       (high = +0.0°C, hyst = +0.0°C)
                       (crit = +0.0°C, hyst = +0.0°C)

jc42-i2c-1-1a
Adapter: SMBus I801 adapter at e000
temp1: +29.5°C (low = +0.0°C) ALARM (HIGH, CRIT)
                       (high = +0.0°C, hyst = +0.0°C)
                       (crit = +0.0°C, hyst = +0.0°C)

jc42-i2c-1-18
Adapter: SMBus I801 adapter at e000
temp1: +27.2°C (low = +0.0°C) ALARM (HIGH, CRIT)
                       (high = +0.0°C, hyst = +0.0°C)
                       (crit = +0.0°C, hyst = +0.0°C)

jc42-i2c-1-1b
Adapter: SMBus I801 adapter at e000
temp1: +28.2°C (low = +0.0°C) ALARM (HIGH, CRIT)
                       (high = +0.0°C, hyst = +0.0°C)
                       (crit = +0.0°C, hyst = +0.0°C)

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

You need to set the temperature limits correctly. Without limits, the chips will persistently generate alarms which is the likely cause of the interrupts.

That won't solve the completion interrupt timeouts, though. That may be another problem.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Guenter Roeck from comment #2)
> You need to set the temperature limits correctly. Without limits, the chips
> will persistently generate alarms which is the likely cause of the
> interrupts.
>
> That won't solve the completion interrupt timeouts, though. That may be
> another problem.

Hi!
Thanks for your answer. I've gave a try and set those limits, so sensors does not show any more ALARM. Seems not to be the cause, because after settings, the interrupts are still generated massivley..

jc42-i2c-1-1b
Adapter: SMBus I801 adapter at e000
RAM: +30.0°C (low = +0.0°C)
                       (high = +80.0°C, hyst = +80.0°C)
                       (crit = +80.0°C, hyst = +80.0°C)

jc42-i2c-1-19
Adapter: SMBus I801 adapter at e000
RAM: +32.0°C (low = +0.0°C)
                       (high = +80.0°C, hyst = +80.0°C)
                       (crit = +80.0°C, hyst = +80.0°C)
jc42-i2c-1-1a
Adapter: SMBus I801 adapter at e000
RAM: +31.0°C (low = +0.0°C)
                       (high = +80.0°C, hyst = +80.0°C)
                       (crit = +80.0°C, hyst = +80.0°C)

jc42-i2c-1-18
Adapter: SMBus I801 adapter at e000
RAM: +28.0°C (low = +0.0°C)
                       (high = +80.0°C, hyst = +80.0°C)
                       (crit = +80.0°C, hyst = +80.0°C)

Cheers
Conrad

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Weird, especially since the chips should not generate interrupts in the first place unless it is explicitly enabled (which the driver doesn't do, or at least shouldn't do). My wild guess is that taking the chips out of shutdown mode for some reasons enables the interrupt.

Can you send the output of "i2cdump -y -f 1 0x18 w" ? Also, do the interrupts stop when you unload the driver ?

Thanks,
Guenter

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Please forget the question about the unload, as you already answered it.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Guenter Roeck from comment #4)
> Weird, especially since the chips should not generate interrupts in the
> first place unless it is explicitly enabled (which the driver doesn't do, or
> at least shouldn't do). My wild guess is that taking the chips out of
> shutdown mode for some reasons enables the interrupt.
>
> Can you send the output of "i2cdump -y -f 1 0x18 w" ?

Here we go:

╭─root@Galactica ~
╰─➤ i2cdump -y -f 1 0x18 w
     0,8 1,9 2,a 3,b 4,c 5,d 6,e 7,f
00: ef00 0000 0005 0000 0005 c801 1f00 0182
08: 0000 0000 0000 0000 0000 0000 0000 0000
10: 0000 0000 0000 0000 0000 0000 0000 0000
18: 0000 0000 0000 0000 0000 0000 0000 0000
20: 0000 0000 0000 0000 0000 0000 0000 0000
28: 0000 0000 0000 0000 0000 0000 0000 0000
30: 0000 0000 0000 0000 0000 0000 0000 0000
38: 0000 0000 0000 0000 0000 0000 0000 0000
40: 0000 0000 0000 0000 0000 0000 0000 0000
48: 0000 0000 0000 0000 0000 0000 0000 0000
50: 0000 0000 0000 0000 0000 0000 0000 0000
58: 0000 0000 0000 0000 0000 0000 0000 0000
60: 0000 0000 0000 0000 0000 0000 0000 0000
68: 0000 0000 0000 0000 0000 0000 0000 0000
70: 0000 0000 0000 0000 0000 0000 0000 0000
78: 0000 0000 0000 0000 0000 0000 0000 0000
80: 0000 0000 0000 0000 0000 0000 0000 0000
88: 0000 0000 0000 0000 0000 0000 0000 0000
90: 0000 0000 0000 0000 0000 0000 0000 0000
98: 0000 0000 0000 0000 0000 0000 0000 0000
a0: 0000 0000 0000 0000 0000 0000 0000 0000
a8: 0000 0000 0000 0000 0000 0000 0000 0000
b0: 0000 0000 0000 0000 0000 0000 0000 0000
b8: 0000 0000 0000 0000 0000 0000 0000 0000
c0: 0000 0000 0000 0000 0000 0000 0000 0000
c8: 0000 0000 0000 0000 0000 0000 0000 0000
d0: 0000 0000 0000 0000 0000 0000 0000 0000
d8: 0000 0000 0000 0000 0000 0000 0000 0000
e0: 0000 0000 0000 0000 0000 0000 0000 0000
e8: 0000 0000 0000 0000 0000 0000 0000 0000
f0: 0000 0000 0000 0000 0000 0000 0000 0000
f8: 0000 0000 0000 0000 0000 0000 0000 0000

>Also, do the interrupts stop when you unload the driver ?

No, they stop first, when I do a complete server reboot.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Ah, forgot to add. Loading the old "eeprom"-module causes the same problem with the interrupts, see [1]. Maybe this is somehow connected?

[1] https://bugzilla.kernel.org/show_bug.cgi?id=177291

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

This is an Atmel AT30TS00. Per configuration register, events are disabled, and there is no event pending, meaning it should not really be the JC42s generating the interrupts.

Another question: If you only load the i801 module after boot (ie prevent the jc42 module from loading, eg by blacklisting it, but still load the i801 module), do you still get the interrupts ?

Thanks,
Guenter

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Guenter Roeck from comment #8)
> Another question: If you only load the i801 module after boot (ie prevent
> the jc42 module from loading, eg by blacklisting it, but still load the i801
> module), do you still get the interrupts ?

That's my current situation ;-) jc42 is only a module, which is currently not being loaded at system startup and i801 is compiled into my kernel. In such case, zero interrupts are generated on i801_smbus.

Cheers
Conrad

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

#7 suggests a problem with the i801 driver and its interrupt handling. #9 contradicts that a bit, though.

Maybe the C2000 has problems with interrupts, or implements it differently than handled by the driver. This may be triggered by an actual access on the bus. You could try to confirm it by running the i2cdump command after booting without the jc42 module loaded (i2cdetect -y 1 should show no reserved addresses) and see if the interrupts start happening.

Thanks,
Guenter

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Guenter Roeck from comment #10)
> #7 suggests a problem with the i801 driver and its interrupt handling. #9
> contradicts that a bit, though.
>
> Maybe the C2000 has problems with interrupts, or implements it differently
> than handled by the driver. This may be triggered by an actual access on the
> bus. You could try to confirm it by running the i2cdump command after
> booting without the jc42 module loaded (i2cdetect -y 1 should show no
> reserved addresses) and see if the interrupts start happening.
>
> Thanks,
> Guenter

You nail it ;-) Right after executing "i2cdump -y -f 1 0x18 w", the interrupts start massively. But jc42 wasn't loaded.

Cheers
Conrad

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Sorry, but I don't know, what do you mean here by reserved?

Before/after executing i2cdump (output is the same):

╭─root@Galactica ~
╰─➤ i2cdetect -y 1
     0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- 08 -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- 18 19 1a 1b -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2e --
30: 30 31 32 33 -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- 49 -- -- -- -- -- --
50: 50 51 52 53 -- -- -- -- -- -- -- -- -- -- -- --
60: -- 61 -- -- -- -- -- -- -- 69 -- -- 6c -- -- --
70: -- -- -- -- -- -- -- --

A simple "i2cdetect -y 1" also triggers the interrupts.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

With "reserved" I meant "a driver for a chip is loaded". After you load the jc42 driver (or the eeprom driver), you'll see that some of the addresses show up as "UU".

Anyway, I think the conclusion is that the i801 driver has problems with interrupt support on your hardware, as I suspected in #10. Issue #177291 is really the same problem. Jean maintains that driver as well, so he should be able to help.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Guenter Roeck from comment #13)
> With "reserved" I meant "a driver for a chip is loaded". After you load the
> jc42 driver (or the eeprom driver), you'll see that some of the addresses
> show up as "UU".

Ah I see. Yes, after loading jc42, I can see "UU".

╭─root@Galactica ~
╰─➤ i2cdetect -y 1
     0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- 08 -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- UU UU UU UU -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2e --
30: 30 31 32 33 -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- 49 -- -- -- -- -- --
50: 50 51 52 53 -- -- -- -- -- -- -- -- -- -- -- --
60: -- 61 -- -- -- -- -- -- -- 69 -- -- 6c -- -- --
70: -- -- -- -- -- -- -- --

> Anyway, I think the conclusion is that the i801 driver has problems with
> interrupt support on your hardware, as I suspected in #10. Issue #177291 is
> really the same problem. Jean maintains that driver as well, so he should be
> able to help.

Should I close #177291 as a duplicate, as it's mine ticket.
Thanks for your support. Hope, Jean has an idea :)

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

Thanks Guenter for stepping in. I always suspected the problem was with the SMBus controller (i2c-i801 driver) and I intended to comment about it long ago but then forgot, sorry about that :-(

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

Conrad, I need detailed information about the SMBus PCI devices and the IRQs on your machine. Please attach the output of:

$ /sbin/lspci -nn | grep SMBus

$ /sbin/lspci -xxx -s <device>
(for each device listed above)

$ cat /proc/interrupts

Also look for any message related to i2c, SMBus, i801 or the PCI devices above in the kernel logs.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :
Download full text (3.3 KiB)

Hello Jean!

(In reply to Jean Delvare from comment #16)
> $ /sbin/lspci -nn | grep SMBus

00:13.0 System peripheral [0880]: Intel Corporation Atom processor C2000 SMBus 2.0 [8086:1f15] (rev 02)
00:1f.3 SMBus [0c05]: Intel Corporation Atom processor C2000 PCU SMBus [8086:1f3c] (rev 02)

> $ /sbin/lspci -xxx -s <device>
> (for each device listed abov

╭─root@Galactica /home/kostecki
╰─➤ lspci -xxx -s 00:13.0
00:13.0 System peripheral: Intel Corporation Atom processor C2000 SMBus 2.0 (rev 02)
00: 86 80 15 1f 46 05 10 00 02 00 80 08 00 00 00 00
10: 04 40 f1 ff 0f 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 20 08
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 10 80 92 00 01 80 00 10 20 08 04 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 01 8c 03 00 00 00 00 00 00 00 00 00 05 00 81 01
90: 0c f0 ef fe 00 00 00 00 a6 41 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 01 00 10 00 10 80
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

╭─root@Galactica /home/kostecki
╰─➤ lspci -xxx -s 00:1f.3
00:1f.3 SMBus: Intel Corporation Atom processor C2000 PCU SMBus (rev 02)
00: 86 80 3c 1f 43 01 98 02 02 00 05 0c 00 00 00 00
10: 00 00 50 df 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 e0 00 00 00 00 00 00 00 00 00 00 d9 15 20 08
30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 02 00 00
40: 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 03 04 04 00 00 00 08 08 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 0f 02 01 03 03 03 00

> $ cat /proc/interrupts

See attachment.

> Also look for any message related to i2c, SMBus, i801 or the PCI devices
> above in the kernel logs.

╭─root@Galactica /
╰─➤ dmesg|grep -i smbus

[ 7.968653] i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
[ 7.970338] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
[ 7.974068] ismt_smbus 0000:00:13.0: enabling device (0140 -> 0142)
[ 974.471917] ismt_smbus 0000:00:13.0: completion wait timed out
[ 975.512022] ismt_smbus 0000:00:13.0: completion wait timed out
[ 976.552097] ismt_smbus 0000:00:13.0: completion wait timed out
[ 977.592124] ismt_smbus 0000:00:13.0: completion wait timed out
[ 978.632168] ismt_smbus 0000:00:13.0: completion wait timed out
[ 979.682207] ismt_smbus 0000:00:13.0: completion wait timed out
[ 980.712251] ismt_smbus 0000:00:13.0: completion wait timed out
[ 981.752310] ismt_smbus 0000:00:13...

Read more...

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Created attachment 246221
cat /proc/interrupts

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Created attachment 246231
dmesg output

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

Can you blacklist ismt-msi, reboot and see if it makes any difference?

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Jean Delvare from comment #20)
> Can you blacklist ismt-msi, reboot and see if it makes any difference?

No, didn't changed anything. I've compiled a new kernel without ismt-msi (CONFIG_I2C_ISMT=n) and still after loading jc42 interrupts go very high.

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

OK, thanks. I have added Intel folks to Cc. I can't find the register descriptions for the Atom C2000 SMBus function, so there's not so much I can do.

Conrad, support for the SMBus in this CPU family was added several years ago to the i2c-i801 driver, so I am wondering why this bug is only reported now.

Is this new hardware for you? Or you have it for some time, and it was working fine so far, and broke with a kernel or OS update?

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

I found some datasheet through Avoton C2750
http://ark.intel.com/products/77987/Intel-Atom-Processor-C2750-4M-Cache-2_40-GHz
->
https://www-ssl.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-c2000-microserver-datasheet.pdf

I guess both C2758 and C2750 are compatible as they are listed in C2000 Product Family for Communications.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Jean Delvare from comment #22)
> Is this new hardware for you? Or you have it for some time, and it was
> working fine so far, and broke with a kernel or OS update?

Yes, this is new hardware. I bought it a few weeks before starting this ticket. So I can't tell, if it was working before.

(In reply to Jarkko Nikula from comment #23)
> I found some datasheet through Avoton C2750
> http://ark.intel.com/products/77987/Intel-Atom-Processor-C2750-4M-Cache-2_40-
> GHz
> ->
> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/datasheets/
> atom-c2000-microserver-datasheet.pdf
>
> I guess both C2758 and C2750 are compatible as they are listed in C2000
> Product Family for Communications.

C2750 is with turbo boost, C2758 has instead of turbo boost a quickassist accelerator. (Don't know, if this makes a difference for the register)

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

Jarkko, I found the same document, however it doesn't appear to contain register definitions, or I am blind.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Jean Delvare from comment #25)
> Jarkko, I found the same document, however it doesn't appear to contain
> register definitions, or I am blind.

Maybe chapter 15.8 and 18.5? Sorry, if that's wrong, as I don't know, if that's, what you are searching?

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Problem is that only the register addresses are provided, not the register definitions. Sure, there is a status register, and we know its address, but we don't know how the bits are defined and if they are defined exactly like in other Intel CPUs.

With the C2000 being a different micro-architecture than the "mainline" Intel CPUs, there is a real possibility that the register definitions are different.

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Sorry, I looked at it too quickly. Indeed definitions are missing. I'll ask http://ark.intel.com/ is there more detailed datasheet available.

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

Conrad, until we sort it out, you may be able to work around the problem by passing option disable_features=0x10 to the i2c-i801 driver.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Jean Delvare from comment #29)
> Conrad, until we sort it out, you may be able to work around the problem by
> passing option disable_features=0x10 to the i2c-i801 driver.

Hey Jean,
seems to help as a workaround after disabling the interrupts for i2c-i801.

[ 7.950079] i801_smbus 0000:00:1f.3: Interrupt disabled by user
[ 7.951624] i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
[ 7.953270] i801_smbus 0000:00:1f.3: SMBus using polling

Cheers
Conrad

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

*** Bug 177291 has been marked as a duplicate of this bug. ***

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Any news for me? :)

Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

Jarkko, were you able to get your hands on a datasheet? It doesn't need to be public, if you can check the register definitions for us.

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

I got one contact info back in December but no response. Maybe busy before holidays and I forgot to ping again. I'll ask again.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Jarkko Nikula from comment #34)
> I got one contact info back in December but no response. Maybe busy before
> holidays and I forgot to ping again. I'll ask again.

Did you got any reply?

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Just only out of office reply back in March but pinged again now.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Jarkko Nikula from comment #36)
> Just only out of office reply back in March but pinged again now.

And now? ;-)

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

Hmm... Seems this one gets somehow abandoned. Jarkko, any news on this? Same question to Conrad, do you have any luck with v5.11 based kernels (or closer to latest)?

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to Andy Shevchenko from comment #38)
> Hmm... Seems this one gets somehow abandoned. Jarkko, any news on this? Same
> question to Conrad, do you have any luck with v5.11 based kernels (or closer
> to latest)?

Nope. No news. Problem still exists with latest kernel.

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Unfortunately I don't have any updates on this.

Revision history for this message
vcarceler (vcarceler-b) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1931001/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
vcarceler (vcarceler-b)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, vcarceler.

Can you give complete dmesg from when this happens?

Thanks for you report.
Cascardo.

Revision history for this message
vcarceler (vcarceler-b) wrote :

Hello Cascardo.

Here you will find dmesg.tgz with:

dmesg/dmesg-2021-05-07-08-22.txt
dmesg/dmesg-normal.txt
dmesg/dmesg-unresponsive.txt
dmesg/dmesg-2021-05-20.txt

dmesg-normal.txt is a full dmesg when the computer works fine.

dmesg-unresponsive.txt is a full dmesg in which the computer boots and very soon becomes unresponsive to trackpad and even you can type faster than computers manages to process. Usually this happens on the first boot after a full day with the laptop shut down.

dmesg-2021-05-07-08-22.txt and dmesg-2021-05-20.txt are full dmesg in which you can see the message CPU# stuck for 22s!

When this happens nothing works well. I even deployed a small script to reboot the laptop when this happens.

We are an school with hundreds of desktops and laptops with ubuntu 20.04 without problems. But we have received a big number of this lenovo laptops that does't work well with ubuntu 20.04 or 21.04.

I don't know if it may help you but with Fedora 34 the laptop works fine.

Thank you for your attention.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

This bug gives me an idea to try MSI on i801, but it appears that there is none of the platforms that have MSI capability on this device. Not sure if it's usable information, but I think it's better to share it anyway.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1931001

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Jan Herold (yzle)
affects: linux (Ubuntu) → linux-hwe-5.11 (Ubuntu)
Changed in linux-hwe-5.11 (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Jan Herold (yzle) wrote : Re: kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s!

I also have this problem. After the automatic update from kernel 5.8 to 5.11 this error occurs.
The error only occurs during a cold boot. When rebooting the system, this error does not occur.

Here an interesting thread about this problem: https://bbs.minisforum.com/threads/ubuntu-stuck-booting.1830/

Revision history for this message
In , byron.c.hawkins (byron.c.hawkins-redhat-bugs) wrote :

1. Please describe the problem:

Fedora 34 is totally unusable on an Acer Aspire 1 A114-32-P9MN (laptop), which probably does not have a quality BIOS implementation, but does work fine with Debian 11, Oracle Linux 8.4, etc. It only has problems with Fedora 34. The machine constantly reports "soft lockup" and something about a watchdog, which I know nothing about, really. The "soft lockup" occurs in many different modules and contexts (as indicated by the vast number of stack traces in the system logs). Booting from a live USB of Fedora 34, it often took more than 30 minutes to reach the initial desktop, whereas Oracle Linux boots in about 10 seconds and never causes a "soft lockup". I tried dozens of configuration adjustments to workaround the problem, but nothing improved. Considering the large number of user reports mentioning "soft lockup" on Fedora 34, it seems to me that something is seriously wrong with the build. For now, I have moved to Oracle Linux and will not install Fedora again on any machine.

2. What is the Version-Release number of the kernel:

5.13.4-200.fc34.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear? Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

I didn't notice any problems on Fedora 32. After upgrading to Fedora 34 (5.13.4-200.fc34.x86_64), the machine is totally unusable because of constant "soft lockups" occurring in many different components and contexts.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Install Fedora 34 on an Acer Aspire 1 A114-32-P9MN, or just boot from a live USB. It will hang with soft lockups.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Sorry, I switched to Oracle Linux, and am in the process of migrating all my machines. Fedora is not an option if it has such severe problems on basic commodity hardware.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No, just a plain live USB will trigger the problem at its fullest severity.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Sorry, the system has been wiped clean for an install of Oracle Linux, which works with no problems. But it any case, the machine was responding so poorly under Fedora 34 that it would have been nearly impossible to obtain the logs, even by simply copying them to a USB drive. The machine is entirely crippled under Fedora 34.

Revision history for this message
paul janssen (pauluswaulus) wrote (last edit ):

Same same, after upgrade from 5.8 to 5.11.
Soft lockups during boot, long boot time, and after that a very slow machine.
My machine is also Intel Celeron based, like the previous reports.

After login into the desktop environment, the "atop" program shows that almost all cpu time is spent in irq, where this normally is close to 0 percent. (see attachment)
Login in was hard because not all keyboard input was processed.
The "old" ubuntu is still working as expected (see attachment).

See attachment for logs.

I tried the following:
* Ubuntu Live image on usb: same problem
* Fedora Live image on usb: same problem
* wait until the boot process comes through and collect the logs (kernel params: nomodeset debug verbose)
* perform an apt-get upgrade; apt-get update, reboot, problem still present
* fsck , all was okay.
* tried kernel parameter intel_idle.max_cstate=1 (https://wiki.archlinux.org/title/Intel_graphics#Baytrail_complete_freeze), same problem
* tried kernel parameter noapic (following a hunch), same problem

Revision history for this message
paul janssen (pauluswaulus) wrote (last edit ):

Since I was not able to use "Also affects distribution/package" functionality for at least Arch Linux, I add the following links of very similar bug reports in distro's other than Ubuntu:
* https://bugs.archlinux.org/task/71575
* https://bugs.archlinux.org/task/70236
* https://bbs.archlinux.org/viewtopic.php?id=264127
* https://forum.manjaro.org/t/soft-lockup-during-boot/64257
* https://bugzilla.redhat.com/show_bug.cgi?id=2009977
* https://bugzilla.redhat.com/show_bug.cgi?id=1980928
* https://bugzilla.redhat.com/show_bug.cgi?id=1977553

The match I looked for was:
* soft lock ups during boot
* If boot log available: RIP at either "__do_softirq" or "cpuidle_enter_state" , although this might be HW dependent.

Revision history for this message
paul janssen (pauluswaulus) wrote (last edit ):

Also tried:
* kernel parameter watchdog_thresh=20, same problem
* BIOS setting fast boot=disabled ( was enabled), same problem

Revision history for this message
paul janssen (pauluswaulus) wrote :
Download full text (3.4 KiB)

Possible work around (not a fix), blacklist module i2c_i801. It works for me ...

Since I noticed a high amount of CPU time spent in interrupt handling I looked at /proc/interrupts (right after the slow boot and slow login):
$ cat /proc/interrupts
            CPU0 CPU1
   0: 9 0 IR-IO-APIC 2-edge timer
   1: 0 249 IR-IO-APIC 1-edge i8042
   8: 1 0 IR-IO-APIC 8-fasteoi rtc0
   9: 0 1017 IR-IO-APIC 9-fasteoi acpi
  14: 0 591 IR-IO-APIC 14-fasteoi INT3453:00, INT3453:01, INT3453:03
  15: 0 0 IR-IO-APIC 15-fasteoi INT3453:02
  20: 190734634 0 IR-IO-APIC 20-fasteoi i801_smbus
  31: 8350 0 IR-IO-APIC 31-fasteoi idma64.0, i2c_designware.0
  39: 0 84628 IR-IO-APIC 39-fasteoi mmc0
 120: 0 0 DMAR-MSI 0-edge dmar0
 121: 0 0 DMAR-MSI 1-edge dmar1
 122: 0 0 IR-PCI-MSI 311296-edge PCIe PME
 123: 0 0 IR-PCI-MSI 315392-edge PCIe PME
 124: 0 0 IR-PCI-MSI 317440-edge PCIe PME
 125: 0 0 IR-PCI-MSI 294912-edge ahci[0000:00:12.0]
 126: 0 3 IR-PCI-MSI 1048576-edge rtsx_pci
 127: 4171 0 IR-PCI-MSI 344064-edge xhci_hcd
 128: 0 296 INT3453:00 18 ELAN0503:00
 129: 0 0 IR-PCI-MSI 1050624-edge enp2s0f1
 130: 0 44 IR-PCI-MSI 245760-edge mei_me
 131: 18279 0 IR-PCI-MSI 1572864-edge ath10k_pci
 132: 0 669 IR-PCI-MSI 229376-edge snd_hda_intel:card0
 NMI: 690 49 Non-maskable interrupts
 LOC: 693366 704015 Local timer interrupts
 SPU: 0 0 Spurious interrupts
 PMI: 690 49 Performance monitoring interrupts
 IWI: 31340 91937 IRQ work interrupts
 RTR: 0 0 APIC ICR read retries
 RES: 23071 21772 Rescheduling interrupts
 CAL: 10091 3666 Function call interrupts
 TLB: 2750 4570 TLB shootdowns
 TRM: 0 0 Thermal event interrupts
 THR: 0 0 Threshold APIC interrupts
 DFR: 0 0 Deferred Error APIC interrupts
 MCE: 0 0 Machine check exceptions
 MCP: 10 11 Machine check polls
 ERR: 0
 MIS: 0
 PIN: 0 0 Posted-interrupt notification event
 NPI: 0 0 Nested posted-interrupt event
 PIW: 0 0 Posted-interrupt wakeup event

This lead me to the module i801_smbus which depends on i2c_i801 module (found this out using lsmod).
Following this ~similar~ issue (https://bbs.archlinux.org/viewtopic.php?id=254885) I decided to give blacklisting i2c_i801 a try.

I added "module_blacklist=i2c_i801" to the kernel parameters (via edit action in grub boot menu), and "viola!" the problem was gone.

Note: I do not fully understand the consequences of not having the i2C_i801 and i801_smbus modules.

The ?bett...

Read more...

Revision history for this message
In , stephane.poignant (stephane.poignant-linux-kernel-bugs) wrote :
Download full text (4.0 KiB)

Not sure that's completely related, but would assume at least partially.
I have two mini-servers, one with a Supermicro A2SDi-8C-HLN4F (Atom C3758), and the other one with an older Supermicro A1SRM-2758F (Atom C2758F).

I upgraded both from Debian Buster (kernel 4.19.194-3) to Bullseye (5.10.46-5). No issue on the C3758, but i was faced with severe performance regression on the C2758F.

When running 5.10 on the C2758F, /proc/interrupts shows about 100k interrupts per second for 'IO-APIC 18-fasteoi i801_smbus', and overall performance suffers a lot (e.g. iperf between two KVM virtual machines bridged together is 93% slower with 5.10 than with 4.19).

So far i was getting around the issue by blocklisting i2c_i801. After i found this, i tried adding the disable_features=0x10 option, and that worked too.

I'm not using jc42 at all, sensors thresholds are set to correct values by the distro tools.

# i2cdetect -l

# sensors
nvme-pci-0400
Adapter: PCI adapter
Composite: +30.9°C (low = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1: +30.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +31.9°C (low = -273.1°C, high = +65261.8°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0: +48.0°C (high = +98.0°C, crit = +98.0°C)
Core 1: +48.0°C (high = +98.0°C, crit = +98.0°C)
Core 2: +48.0°C (high = +98.0°C, crit = +98.0°C)
Core 3: +48.0°C (high = +98.0°C, crit = +98.0°C)
Core 4: +47.0°C (high = +98.0°C, crit = +98.0°C)
Core 5: +46.0°C (high = +98.0°C, crit = +98.0°C)
Core 6: +47.0°C (high = +98.0°C, crit = +98.0°C)
Core 7: +47.0°C (high = +98.0°C, crit = +98.0°C)

# dmesg | egrep -i '(smbus|i801)'
[ 2.226240] ismt_smbus 0000:00:13.0: enabling device (0000 -> 0002)
[ 2.229927] i801_smbus 0000:00:1f.3: enabling device (0000 -> 0003)
[ 2.230089] i801_smbus 0000:00:1f.3: SPD Write Disable is set
[ 2.230136] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt

~# lspci -nn | grep SMBus
00:13.0 System peripheral [0880]: Intel Corporation Atom processor C2000 SMBus 2.0 [8086:1f15] (rev 03)
00:1f.3 SMBus [0c05]: Intel Corporation Atom processor C2000 PCU SMBus [8086:1f3c] (rev 03)

# lspci -xxx -s 00:13.0
00:13.0 System peripheral: Intel Corporation Atom processor C2000 SMBus 2.0 (rev 03)
00: 86 80 15 1f 06 04 10 00 03 00 80 08 00 00 00 00
10: 04 70 31 df 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 20 08
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 10 80 92 00 01 80 00 10 20 08 04 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 01 8c 03 00 00 00 00 00 00 00 00 00 05 00 81 01
90: 04 00 e4 fe 00 00 00 00 21 40 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 01 00 10 00 10 80
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -xxx -s 00:1f.3
00:1f.3 SMBus: Intel Corporat...

Read more...

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Yes, this is the same problem here. But Intel doesn't seem to be interessted here :-(

Revision history for this message
paul janssen (pauluswaulus) wrote (last edit ):

I also tried blacklisting only "i801_smbus" but that gave the same issue.
Only blacklisting "i2c_i801" is currently the best workaround.

Changed in fedora:
importance: Unknown → Undecided
status: Unknown → Confirmed
Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

(In reply to stephane.poignant from comment #42)
> I upgraded both from Debian Buster (kernel 4.19.194-3) to Bullseye
> (5.10.46-5). No issue on the C3758, but i was faced with severe performance
> regression on the C2758F.
>
Interesting, so was the 4.19 working on the C2758F without interrupt storm?

Revision history for this message
In , stephane.poignant (stephane.poignant-linux-kernel-bugs) wrote :

(In reply to Jarkko Nikula from comment #44)
> (In reply to stephane.poignant from comment #42)
> > I upgraded both from Debian Buster (kernel 4.19.194-3) to Bullseye
> > (5.10.46-5). No issue on the C3758, but i was faced with severe performance
> > regression on the C2758F.
> >
> Interesting, so was the 4.19 working on the C2758F without interrupt storm?

I haven't checked the /proc/interrupts when running 4.19 so i cannot tell for sure that the interrupts were not there. The performance regression was not there for sure. I can check this in a couple of weeks (server at a remote location with no oobm network).

Dmesg when running 4.19 shows it had interrupts enabled:

[ 0.000000] Linux version 4.19.0-17-amd64 (<email address hidden>) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.194-3 (2021-07-18)
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.0-17-amd64 root=/dev/mapper/vg1--hrbpsrv01-h--hrbpsrv01 ro quiet rd.luks.options=discard
...
[ 1.434097] Run /init as init process
[ 1.782787] dca service started, version 1.12.1
[ 1.783203] ismt_smbus 0000:00:13.0: enabling device (0000 -> 0002)
[ 1.796694] cryptd: max_cpu_qlen set to 1000
[ 1.801177] i801_smbus 0000:00:1f.3: enabling device (0000 -> 0003)
[ 1.801317] i801_smbus 0000:00:1f.3: SPD Write Disable is set
[ 1.801356] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
[ 1.805199] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 1.805202] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 1.805246] igb 0000:00:14.0: enabling device (0000 -> 0002)
[ 1.816722] SSE version of gcm_enc/dec engaged.
...

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

The problem do persists in kernel 4.19 and other versions. It only depens, if a different driver triggers the interrupts. If so, they are counting very high. So it's possible, that you had none driver in 4.19 using those interrupts and as a consequence, the bug did not trigger.

@Jarkko Nikula: Since you are still replying, could you please try again and further to get the needed docs, as requested by Jean Delvare?

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

@Conrad Kostecki: Yeah, I agree with you it's unlikely problem was absent in 4.19 as it was present way before it.

I was in contact with our sales support and they told the Atom C2758 with F-postfix is custom to SuperMicro. Unfortunately they didn't find explicit specification for the SMBus controller on it but they told it's based on the same 22 nm Silvermonth architecture than the Bay Trail. I suppose SMBus IO should be compatible.

Unfortunately public datasheets for Bay Trails seems scarce too but I was able to find something when searching datasheets for the Bay Trail E3825 used in MinnowBoard Max. Following document seems to be available for the registered ark.intel.com user or by search engines:

"Intel Atom ® Processor E3800 Product Family" with Document Number: 538136 and Chapter 33 "PCU – System Management Bus (SMBus)"

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Created attachment 299193
Debug patch for the i2c-i801 interrupts

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Could you try attached patch what interrupt statuses it will print in case of interrupt storm? It's rate limited debug print so it shouldn't flood the dmesg.

You need to have CONFIG_DYNAMIC_DEBUG=y in your kernel config and either enable the debug print in runtime by following:

mount none /sys/kernel/debug -t debugfs
echo -n "func i801_isr +p" >/sys/kernel/debug/dynamic_debug/control

or by appending that to your kernel command line:
i2c_i801.dyndbg="func i801_isr +p"

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :
Download full text (4.0 KiB)

Here is the output:

pcicst 0x298, SMBHSTSTS 0x60
[ 359.205884] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 359.205918] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210031] i801_isr: 375367 callbacks suppressed
[ 364.210043] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210085] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210126] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210142] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210178] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210217] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210234] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210253] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210292] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 364.210329] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220035] i801_isr: 380909 callbacks suppressed
[ 369.220047] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220069] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220109] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220146] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220185] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220222] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220262] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220278] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220317] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 369.220333] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230078] i801_isr: 393736 callbacks suppressed
[ 374.230109] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230151] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230191] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230210] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230248] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230283] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230297] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230332] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230345] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 374.230358] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240037] i801_isr: 382705 callbacks suppressed
[ 379.240068] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240090] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240110] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240130] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240150] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240186] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240205] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240242] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240281] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 379.240297] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 384.250032] i801_isr: 387109 callback...

Read more...

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Thanks. Those debug prints confirm the interrupt is really coming from the SMBus controller (bit 3 is set in PCI status) and the SMB alert bit is set.

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Created attachment 299201
Experimental patch disabling SMB_ALERT signal

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

@Conrad Kostecki: Could you try does the attached experimental patch which disables the SMB_ALERT help here.

Revision history for this message
In , stephane.poignant (stephane.poignant-linux-kernel-bugs) wrote :

Thanks for the follow up, i will test the patch on my setup as well by next week.

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

I just tested the patch and can confirm, it works. After applying patch, interrupts dropped nearly to zero on i801_smbus.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to Conrad Kostecki from comment #55)
> I just tested the patch and can confirm, it works. After applying patch,
> interrupts dropped nearly to zero on i801_smbus.

According to the specification the host (if implemented ALERT) should issue special byte read command to see which device wants to send something. If the proper implementation won't fix this, it might be some pin configuration issue (like pull down sitting on the respective pin) or PCB or firmware (BIOS) issues.
Would be nice to understand, if it can be done without much efforts, what's exactly is making the ALERT be asserted.

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

I was thinking too should there be proper acknowledging for the SMB_ALERT but since the driver currently doesn't have support for it I wanted to see does simple disabling help.

Fortunately I was able to reproduce issue locally in an another platform where the SMB_ALERT was connected to a resistor and was able to pull-down the signal by a wire. Interrupt storm begins when the SMB_ALERT goes down for a moment and continues after.

I'll test a bit more and make a proper patch. One thing I'm wondering should the driver restore the original disable status on driver removal like what is done for host notify in i801_disable_host_notify().

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

Created attachment 299217
2nd version of patch disabling SMB_ALERT signal

I moved the SMB_ALERT signal disabling to i801_enable_host_notify() since the SMBSLVCMD register is available on ICH3 and later. Also it keeps the original value prior to driver load.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to Jarkko Nikula from comment #58)
> 2nd version of patch disabling SMB_ALERT signal

Side remark: Looking into this code, shouldn't you first clean current notifications and only after that enable IRQ?

Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Patch v2 works for me. Interrupts still are fine and do not go crazy.

Revision history for this message
In , stephane.poignant (stephane.poignant-linux-kernel-bugs) wrote :
Download full text (5.7 KiB)

I can confirm that i am getting the same results with the two patches on my setup with the Debian kernels.
Debug patch produces the same messages, and with SMB_ALERT disable patch there was no longer any interrupt triggered.

Also when booting into the previous kernel i was using (linux-image-4.19.0-17-amd64 4.19.194-3), the module loads with the default config but i am not getting any interrupt. So for my particular setup the issue only appeared after upgrading from Debian kernel 4.19 to 5.10.

Will test the second version of the patch ASAP and provide you with the results.

## Kernel 4.16

# uname -a
Linux hrbpsrv01.intra.lan 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

# cat /proc/interrupts | grep i801
 18: 0 0 0 0 0 0 0 0 IO-APIC 18-fasteoi i801_smbus

# dmesg
...
[ 6652.023634] i801_smbus 0000:00:1f.3: SPD Write Disable is set
[ 6652.023689] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
...

## Debian linux-image-5.10.0-9-amd64 (5.10.70-1) + Debug patch

# uname -a
Linux hrbpsrv01.intra.lan 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

# cat /proc/interrupts | grep i801
 18: 0 0 0 0 0 7358862 0 0 IO-APIC 18-fasteoi i801_smbus
(increase at about 100k interrupts/sec)

# dmesg
...
[ 516.429120] i801_smbus 0000:00:1f.3: SPD Write Disable is set
[ 516.429140] i801_smbus 0000:00:1f.3: An interrupt is pending!
[ 516.429161] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
[ 516.429933] i2c i2c-1: 4/4 memory slots populated (from DMI)
[ 516.430337] at24 1-0050: supply vcc not found, using dummy regulator
[ 516.431043] at24 1-0050: 256 byte spd EEPROM, read-only
[ 516.431078] i2c i2c-1: Successfully instantiated SPD at 0x50
[ 516.431455] at24 1-0051: supply vcc not found, using dummy regulator
[ 516.432148] at24 1-0051: 256 byte spd EEPROM, read-only
[ 516.432174] i2c i2c-1: Successfully instantiated SPD at 0x51
[ 516.432576] at24 1-0052: supply vcc not found, using dummy regulator
[ 516.433284] at24 1-0052: 256 byte spd EEPROM, read-only
[ 516.433325] i2c i2c-1: Successfully instantiated SPD at 0x52
[ 516.433748] at24 1-0053: supply vcc not found, using dummy regulator
[ 516.434454] at24 1-0053: 256 byte spd EEPROM, read-only
[ 516.434497] i2c i2c-1: Successfully instantiated SPD at 0x53
[ 525.513104] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513133] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513161] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513185] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513209] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513234] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513258] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513281] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513316] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 525.513352] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60
[ 530.514207] i801_isr: 297603 callbacks suppressed
[ 530.5...

Read more...

Revision history for this message
In , stephane.poignant (stephane.poignant-linux-kernel-bugs) wrote :

Patch V2 works for me too.

# cat /proc/interrupts | grep i801
 18: 0 0 0 0 0 8 0 0 IO-APIC 18-fasteoi i801_smbus

Revision history for this message
In , jarkko.nikula (jarkko.nikula-linux-kernel-bugs) wrote :

(In reply to Andy Shevchenko from comment #59)
> (In reply to Jarkko Nikula from comment #58)
> > 2nd version of patch disabling SMB_ALERT signal
>
> Side remark: Looking into this code, shouldn't you first clean current
> notifications and only after that enable IRQ?

That's a good question and made me debugging more. In fact disabling doesn't disable detection and SMBALERT_STS will be set and cause short burst of interrupts during driver load and unload time if SMB_ALERT signal was asserted. Looks like it's better to add basic acknowledging for it into i801_isr().

I'm not sure would clearing pending interrupts at the probe time cause any regression but acknowledging the SMBALERT_STS in i801_isr() makes sure the status doesn't stay forever if it occurs after probe.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to Jarkko Nikula from comment #63)
> (In reply to Andy Shevchenko from comment #59)
> > (In reply to Jarkko Nikula from comment #58)
> > > 2nd version of patch disabling SMB_ALERT signal
> >
> > Side remark: Looking into this code, shouldn't you first clean current
> > notifications and only after that enable IRQ?
>
> That's a good question and made me debugging more. In fact disabling doesn't
> disable detection and SMBALERT_STS will be set and cause short burst of
> interrupts during driver load and unload time if SMB_ALERT signal was
> asserted. Looks like it's better to add basic acknowledging for it into
> i801_isr().
>
> I'm not sure would clearing pending interrupts at the probe time cause any
> regression but acknowledging the SMBALERT_STS in i801_isr() makes sure the
> status doesn't stay forever if it occurs after probe.

It also makes sense to test it with DEBUG_SHIRQ enabled (yes, I know that more than a half of the drivers in the Linux kernel will either crash or behave badly on this, not many developers know about the debugging feature).

Revision history for this message
paul janssen (pauluswaulus) wrote : Re: kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s!

I started kernel bisecting in an attempt to find the commit that causes this issue.
Painfull process.

I found another workaround (not a solution) on stackoverflow (which has been deleted from stackoverflow by now). The workaround was to disable virtualization in the BIOS: Intel VTX -> disabled , Intel VTD -> disabled. This "worked" for me. The machine booted. But ... /proc/interrupts still showed about 50.000 interrupts/sec from the smbus. So, the issue of mucho interrupts is still there but it is somehow rate limited allowing the machine to boot and be sufficiently responsive. I prefer blacklisting i2c_i801 upto now.

Revision history for this message
paul janssen (pauluswaulus) wrote :

New best workaround, instead of blacklisting i2c-i801 keep it but disable interrupts and use polling instead.

Step 1 Temporary and to be able to boot for step 2).
a. To able to boot enter the grub menu (press ESC once during boot)
b. select the (Ubuntu)Linux entry you want to boot and press "e" to edit this.
c. edit the line start with " linux /boot/vmlinuz ....."
d. at the end of this line add " i2c-i801.disable_features=0x10"
e. press F10
Now the machine will boot with this new i2c-i801 module parameter. This will happen only once, next boot will be without this parameter (unless you manually add it again by repeating the above steps).

Step 2 After the boot and login, make it last:
a. Run "sudo vi /etc/modprobe.d/i2c-i801.conf"
b. Add the line "options i2c-i801 disable_features=0x10"
c To make sure its used at boot-time run: "sudo update-initramfs -u"

With this best workaround the module i2c-i801 is still loaded but using polling instead of interrupts. I think this is better then no i2c-i801 at all.
I can boot, the issue does not occur.

Still a workaround ..

Revision history for this message
Tobias Karnat (tobiaskarnat) wrote :

My Lenovo Ideapad Duet 3i with Ubuntu 22.04 (Kernel 5.13) is also affected (Current workaround disable_features=0x10).

Changed in linux:
importance: Unknown → Medium
status: Unknown → Incomplete
Revision history for this message
Tobias Karnat (tobiaskarnat) wrote : Re: kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s!

I cannot take any logs with apport-collect, because the boot is to slow too finish when this happens.
So please change the status to Confirmed.

Jan Herold (yzle)
Changed in linux-hwe-5.13 (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
status: New → Confirmed
status: Confirmed → Incomplete
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Tobias Karnat (tobiaskarnat) wrote :

The proposed patch from bugzilla.kernel.org

tags: added: patch
Revision history for this message
In , jdelvare (jdelvare-linux-kernel-bugs) wrote :

This bug is believed to be fixed in kernel v5.16 by the following 2 commits:

commit 03a976c9afb5e3c4f8260c6c08a27d723b279c92
Author: Jarkko Nikula
Date: Wed Nov 17 11:45:09 2021 +0200

    i2c: i801: Fix interrupt storm from SMB_ALERT signal

commit 9b5bf5878138293fb5b14a48a7a17b6ede6bea25
Author: Jean Delvare
Date: Tue Nov 9 16:02:57 2021 +0100

    i2c: i801: Restore INTREN on unload

Revision history for this message
paul janssen (pauluswaulus) wrote :

Believed to be fixed in the kernel by two commits.
See: https://bugzilla.kernel.org/show_bug.cgi?id=177311#c65

Changed in linux:
status: Incomplete → Fix Released
summary: - kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s!
+ Soft lockup due to interrupt storm from smbus
Revision history for this message
In , ck+kernelbugzilla (ck+kernelbugzilla-linux-kernel-bugs) wrote :

Upgraded to kernel 5.16 today no more irq noise. Thank you!

Revision history for this message
In , bcotton (bcotton-redhat-bugs) wrote :

This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version'
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora Linux 34 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Revision history for this message
Dave Jones (waveform) wrote (last edit ):

Also affects my Acer Aspire TravelMate Spin B118 on Ubuntu 22.04. The i2c-i801 workaround from comment 14 above (https://bugs.launchpad.net/ubuntu/+source/linux-hwe-5.11/+bug/1931001/comments/14) works nicely.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.