workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND

Bug #2038492 reported by Robert Ross
68
This bug affects 12 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Benjamin Wheeler
linux-oem-6.5 (Ubuntu)
New
Undecided
Unassigned

Bug Description

dmesg has multiple messages:
[ 852.580542] evict_inodes inode 00000000be24a21a, i_count = 1, was skipped!
[ 852.580543] evict_inodes inode 000000002aea522e, i_count = 1, was skipped!
[ 852.580544] evict_inodes inode 0000000048d75a5c, i_count = 1, was skipped!
[ 852.580545] evict_inodes inode 00000000d8a9e4c2, i_count = 1, was skipped!
[ 852.580546] evict_inodes inode 000000003d2c905c, i_count = 1, was skipped!
[ 852.580547] evict_inodes inode 00000000e5b1b232, i_count = 1, was skipped!
[ 852.580548] evict_inodes inode 0000000097383a6b, i_count = 1, was skipped!
[ 852.580549] evict_inodes inode 00000000ca8e2b44, i_count = 1, was skipped!
[ 1751.869281] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[ 1781.467278] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 1806.065364] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[ 1901.993975] evict_inodes inode 00000000abaa740e, i_count = 1, was skipped!
[ 1901.993981] evict_inodes inode 00000000515b9bf8, i_count = 1, was skipped!
[ 1901.993983] evict_inodes inode 000000001a69d536, i_count = 1, was skipped!
[ 1901.993984] evict_inodes inode 000000001403f675, i_count = 1, was skipped!
[ 1901.993985] evict_inodes inode 00000000757de21a, i_count = 1, was skipped!
[ 1901.993986] evict_inodes inode 000000000cee9028, i_count = 1, was skipped!
[ 1901.993987] evict_inodes inode 00000000d2827e77, i_count = 1, was skipped!

Description: Ubuntu 22.04.3 LTS
Release: 22.04

linux-oem-22.04d:
  Installed: 6.5.0.1004.4
  Candidate: 6.5.0.1004.4

Seems related to #2037214

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-oem-6.5 (Ubuntu):
status: New → Confirmed
Revision history for this message
Nazar Mokrynskyi (nazar-pc) wrote :

Same is happening to me, probably related to BTRFS?

Revision history for this message
Tonal (tonal-promsoft) wrote :
Revision history for this message
Gordon Lack (gordon-lack) wrote :

The duplicate status is *clearly wrong*, as #2037214 refers to the "evict_inodes" issue (now fixed at some point) whereas this bug report is, according to its title, about the "workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4" issue.

This has not been fixed (I'm seeing it on 6.8.0-31-generic in Noble).

There are solutions posted for it.
See:
 https://lkml.org/lkml/2023/8/30/613
 https://gitlab.freedesktop.org/drm/intel/-/issues/9245
 https://<email address hidden>/T/

which I picked up via: https://ubuntuforums.org/showthread.php?t=2490640

Revision history for this message
Davor Grubisa (horzadome) wrote :

Having this issue too.
For me the easiest way to reproduce it is to trigger btrfs balance and wait a minute or two, no other workloads running.

root@build01:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04 LTS
Release: 24.04
Codename: noble

Fully updated, running the latest available kernel
Linux build01 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 10:49:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

FWIW I haven't noticed this issue on any arch kernels running on similar setup and platform (btrfs data single, meta mirror, on nvmes amd 5950x).

Revision history for this message
Matthias Wallnöfer (mwallnoefer) wrote (last edit ):

In my case I face the same issue with the "delayed_fput" module. And yes, I am also running btrfs (OS Ubuntu 22.04.4 LTS / Kernel 6.5.0-45-generic).

[ 208.207228] workqueue: delayed_fput hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[ 259.794567] workqueue: delayed_fput hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 378.217600] workqueue: delayed_fput hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[ 485.636678] workqueue: delayed_fput hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND

Revision history for this message
maclenin (maclenin) wrote :

I am also facing a similar "delayed_fput" issue. However, I am *not* running btrfs. (Xubuntu 22.04 / ext4 / 6.5.0-45-generic / Intel i5-9600K 3.7GHZ / 32gb ram).

journalctl output is to demonstrate an earlier pattern:

Apr 26 10:44:56 server kernel: workqueue: delayed_fput hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
Apr 26 11:18:25 server kernel: workqueue: delayed_fput hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
Apr 26 16:53:22 server kernel: workqueue: delayed_fput hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
Apr 27 15:12:28 server kernel: workqueue: delayed_fput hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
Apr 27 19:25:50 server kernel: workqueue: delayed_fput hogged CPU for >10000us 64 times, consider switching to WQ_UNBOUND
Apr 29 12:14:07 server kernel: workqueue: delayed_fput hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND
Apr 30 09:46:43 server kernel: workqueue: delayed_fput hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
May 01 13:29:35 server kernel: workqueue: delayed_fput hogged CPU for >10000us 512 times, consider switching to WQ_UNBOUND

However, the issue persists:

Jul 29 15:10:55 server kernel: workqueue: delayed_fput hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
Jul 29 15:13:26 server kernel: workqueue: delayed_fput hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND

Could be workload-related: Running Chrome alone did not trigger, adding Chromium eventually triggers (~4 hours later).

Have experienced 2 recent (22 July / 25 July) crashes requiring hard re-start. Not certain whether related (to the "hog").

Revision history for this message
DrewM (drewm27) wrote (last edit ):

Also having this issue, after about 4 days of uptime the network adapter seems to have delays introduced, so pinging other systems on our external interface gets wildly swinging ping times. It does not appear on our internal network which is strange because it's the same model of network card. A reboot seems to fix it for a few days.

This is on a Dell T630 with all firmware up to date. Ubuntu 24.04 kernel 6.8.0-44-generic

[356601.173313] workqueue: psi_avgs_work hogged CPU for >10000us 2048 times, consider switching to WQ_UNBOUND
[358332.263215] workqueue: process_srcu hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[358997.675712] workqueue: kfree_rcu_work hogged CPU for >10000us 128 times, consider switching to WQ_UNBOUND

root@nike:~# ping www.google.com
PING www.google.com (142.250.189.196) 56(84) bytes of data.
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=1 ttl=119 time=1.25 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=2 ttl=119 time=1.17 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=3 ttl=119 time=51.3 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=4 ttl=119 time=11.3 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=5 ttl=119 time=1.18 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=7 ttl=119 time=105 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=8 ttl=119 time=85.4 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=9 ttl=119 time=1.24 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=10 ttl=119 time=6.30 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=11 ttl=119 time=2.14 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=12 ttl=119 time=1.28 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=13 ttl=119 time=39.3 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=14 ttl=119 time=45.8 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=15 ttl=119 time=1.20 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=16 ttl=119 time=47.6 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=17 ttl=119 time=37.1 ms
64 bytes from sfo03s25-in-f4.1e100.net (142.250.189.196): icmp_seq=18 ttl=119 time=12.2 ms

Revision history for this message
David Leibovic (dleibovic) wrote :
Download full text (4.4 KiB)

BACKGROUND
-------------
We were also seeing these log lines on some of our web servers at Etsy:

workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND

Kernel version: Linux 6.8.0-1019-gcp #21~22.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

When this happens, we see it correlated with elevated system CPU: https://i.imgur.com/yi6tuor.png

In htop, we see kernel workers were using lots of CPU in a function inode_switch_wbs: https://i.imgur.com/ftajwgb.png

perf recording flame graphs confirm similar: https://i.imgur.com/s2nswZR.png

DIGGING IN
-------------
We see this elevated system CPU usage 30 seconds to 5 minutes after writing thousands of small files to disk (part of our web server bootstrapping process). The function inode_switch_wbs_work_fn ( https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/fs/fs-writeback.c#L490 ) that we believe is causing the elevated system CPU is responsible for moving an inode from one cgroup to another in the context of writeback.

Our understanding is: in the context of cgroups, page cache writeback is tracked at the inode level. A given inode is assigned to whichever cgroup contains the process that is responsible for the majority of writes to the inode’s file. If a new process starts writing a lot to a file, the file’s inode may switch to the new process’s cgroup. Likewise, if a process managed by systemd is terminated, systemd will remove the process’s cgroup, at which point any inodes assigned to the process’s cgroup will be moved to the parent cgroup.

If we have a oneshot systemd service that writes thousands of files to disk, initially the inodes corresponding to those files are assigned to the oneshot service's cgroup: https://i.imgur.com/elGeb51.png

After the oneshot systemd service exits, its cgroup is removed, and inodes are moved to the parent cgroup: https://i.imgur.com/RuL7e7R.png

The inode_switch_wbs_work_fn that we believe is causing the elevated system CPU is responsible for moving an inode from one cgroup to another in the context of writeback.

MINIMAL REPRODUCTION
---------------------
sudo mkdir -p /var/random-files && sudo systemd-run --property=Type=oneshot bash -c 'dd if=/dev/urandom bs=1024 count=400000 | split -a 16 -b 1k - /var/random-files/file.'

This command creates 400,000 files, each consisting of 1,024 random bytes. The files have names like /var/random-files/file.aaaaaaaaaaaaaaaa and /var/random-files/file.aaaaaaaaaaaaaaab. This command is run as a systemd oneshot service. Within anywhere from 30 seconds to 5 minutes after this command finishes, we see a spike in system CPU. Viewing htop will confirm this (press shift + k to show kernel threads in htop) – we see kernel workers using lots of CPU in the function inode_switch_wbs.

DEBUGGING
-------------
Using bpftrace, we can get more debugging information with a tracepoint:

sudo bpftrace -e '
tracepoint:writeback:inode_switch_wbs {
  printf(
 "[%s] inode is switching! inode: %d old cgroup: %d new cgroup: %d\n",
 strftime("%H:%M:%S", nsecs),
 args->ino,
 args->old_cgroup_ino,
 args->new_cgroup_ino
  );
}'

When running our minimal repro...

Read more...

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

I suspect the issue is caused by this commit:

commit c22d70a162d3cc177282c4487be4d54876ca55c8
Author: Roman Gushchin <email address hidden>
Date: Mon Jun 28 19:36:03 2021 -0700

    writeback, cgroup: release dying cgwbs by switching attached inodes

It introduces a kworker "cleanup_offline_cgwbs_work", which leverages inode_switch_wbs_work_fn to keep moving inodes from dying cgwb to their parent, so it can reclaim more memory from wb structures.

Moving inodes from a wb needs to take its lock (wb->list_lock) which also needs to be taken in inode_switch_wbs, that could be where the contention happens.

I built a test kernel based on gcp kernel (6.8.0-1019) disabling this kworker for "testing only":

- queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work);
+ //queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work);

https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/400986-debug

If you would like to help testing, please add the above test PPA and install:
linux-gcp-6.8-headers-6.8.0-1019
linux-gcp-6.8-tools-6.8.0-1019
linux-headers-6.8.0-1019-gcp
linux-image-unsigned-6.8.0-1019-gcp
linux-modules-6.8.0-1019-gcp
linux-modules-extra-6.8.0-1019-gcp

Please note:
This test kernel is only used to confirm if this kworker is the cause of the issue, not a fix.

Thanks,
Gerald

Revision history for this message
David Leibovic (dleibovic) wrote :

Thanks for sharing all of this, Gerald. I confirm that I am unable to reproduce the issue after installing your test kernel. In case anyone is interested, I wrote a blog post about our debugging journey on this issue: https://dasl.cc/2025/01/01/debugging-our-new-linux-kernel/

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

Thanks David for the feedback on the test kernel and the awesome blog post!

Trent Lloyd (lathiat)
Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Benjamin Wheeler (benjaminwheeler)
Changed in linux-oem-6.5 (Ubuntu):
assignee: nobody → Benjamin Wheeler (benjaminwheeler)
Changed in linux-oem-6.5 (Ubuntu):
status: Confirmed → New
assignee: Benjamin Wheeler (benjaminwheeler) → nobody
Changed in linux (Ubuntu):
importance: Undecided → Low
Revision history for this message
Benjamin Wheeler (benjaminwheeler) wrote :

It sounds like there are appropriate workarounds, and later kernels seem to improve this issue. For now, I will mark this as fixed, and we can reopen later if anyone is still having issues. Thanks all!

Changed in linux (Ubuntu):
importance: Low → Medium
status: Confirmed → Fix Released
Revision history for this message
David Leibovic (dleibovic) wrote :

Interesting, are you able to share which kernel versions include the fixes for this issue?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.