[SRU] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU

Bug #2008745 reported by Michael Reed
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
Medium
Michael Reed
Jammy
Fix Released
Medium
Michael Reed
Kinetic
Won't Fix
Medium
Michael Reed
Lunar
Fix Released
Medium
Michael Reed

Bug Description

[Impact]

Currently Ubuntu kernel has this kernel config disabled.
But in some cases, Intel's Sapphire Rapids High Bandwith
Memory (SPR-HBM) needs this option.

Memory bandwidth has been a bottleneck of increasingly memory bound
workloads. Sapphire Rapids plus HBM is specifically targeted to
cater to these workloads, traditionally served using overprovisioning
of memory devices.

Please search the keyword "fake numa" in https://community.intel.com/t5/Blogs/Products-and-Solutions/HPC/Enabling-High-Bandwidth-Memory-for-HPC-and-AI-Applications-for/post/1335100

[Fix]

Enable CONFIG_NUMA_EMU in our kernel config for 5.15 and later

[Test Plan]
 Use "STREAM-triadd" algorithm* in Intel MLC** to benchmark 3 scenarios (no fake NUMA, 2U fake NUMA and 4U fake NUMA).

* https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-memory-bandwidth-on-stream-triad.html
** https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html

An improvement to performance on a Sapphire Rapids CPU with HBM should be observed

[Where problems could occur]

The regression risk is low

[Other Info]
Jammy
https://code.launchpad.net/~mreed8855/ubuntu/+source/linux/+git/jammy/+ref/lp_2008745_config_numa_emu_2

Kinetic
https://code.launchpad.net/~mreed8855/ubuntu/+source/linux/+git/kinetic/+ref/lp_2008745_config_numa_emu_kinetic

Lunar
https://code.launchpad.net/~mreed8855/ubuntu/+source/linux/+git/lunar/+ref/config_numa_emu_lunar_2

Michael Reed (mreed8855)
Changed in linux (Ubuntu):
assignee: nobody → Michael Reed (mreed8855)
importance: Undecided → Medium
description: updated
summary: - Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU
+ [SRU] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU
Michael Reed (mreed8855)
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2008745

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Jammy):
status: New → Incomplete
Jeff Lane  (bladernr)
description: updated
Revision history for this message
Michael Reed (mreed8855) wrote :
Revision history for this message
Michael Reed (mreed8855) wrote :
description: updated
Changed in linux (Ubuntu Kinetic):
assignee: nobody → Michael Reed (mreed8855)
Changed in linux (Ubuntu Jammy):
assignee: nobody → Michael Reed (mreed8855)
description: updated
Michael Reed (mreed8855)
Changed in linux (Ubuntu Kinetic):
importance: Undecided → Medium
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
Michael Reed (mreed8855)
description: updated
Revision history for this message
Keng-Yu Lin (lexical) wrote :

As described in https://community.intel.com/t5/Blogs/Products-and-Solutions/HPC/Enabling-High-Bandwidth-Memory-for-HPC-and-AI-Applications-for/post/1335100,
for Intel Sapphire Rapids CPUs with HBM, it is possible to utilize the HBM as cache (so called "2LM mode"). It works in the way to create N fake NUMA domains in a size aligned the HBM so that HBM can act like direct-mapped L4 cache.

Since this is not really a "bug" (but is an officially proposed use case from Intel), it's not likely to provide a test case. But from HPE's testing of the test kernel from comment 2, there is improvement to performance on a Sapphire Rapids CPU with HBM.

Changed in linux (Ubuntu Jammy):
status: Incomplete → Confirmed
Revision history for this message
Michael Reed (mreed8855) wrote :

Hi Keng-Yu,

In comment #4, you mentioned there was an improvement in the performance on the Sapphire Rapids CPU with HBM, how exactly did you quantify that?

Revision history for this message
Keng-Yu Lin (lexical) wrote :

@mreed8855:
  We used "STREAM-triadd" algorithm* in Intel MLC** to benchmark 3 scenarios (no fake NUMA, 2U fake NUMA and 4U fake NUMA).

* https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-memory-bandwidth-on-stream-triad.html
** https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html

Revision history for this message
Michael Reed (mreed8855) wrote :

Hi Keng-Yu,

Here is a test kernel for lunar. I had only included the source before and not the debs. Can you please test and verify the config_numa_emu option works and provides the performance boost.

https://people.canonical.com/~mreed/hpe/lp_2008745_config_numa_emu/lunar/

description: updated
Michael Reed (mreed8855)
description: updated
description: updated
description: updated
Changed in linux (Ubuntu Kinetic):
status: New → In Progress
Changed in linux (Ubuntu Lunar):
status: Incomplete → In Progress
Changed in linux (Ubuntu Jammy):
status: Confirmed → In Progress
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Michael Reed (mreed8855)
description: updated
Changed in linux (Ubuntu Kinetic):
status: In Progress → Won't Fix
Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Lunar):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-79.86 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux verification-needed-jammy
Revision history for this message
Keng-Yu Lin (lexical) wrote :

Kernel 5.15.0-79.86 from jammy-proposed is verified to fix this bug.

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.2.0-27.28 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar' to 'verification-done-lunar'. If the problem still exists, change the tag 'verification-needed-lunar' to 'verification-failed-lunar'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux verification-needed-lunar
Revision history for this message
Keng-Yu Lin (lexical) wrote :

Kernel 6.2.0-27.28 in lunar-proposed is verified to fix the bug.

tags: added: verification-done-lunar
removed: verification-needed-lunar
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra-igx/5.15.0-1002.2 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-tegra-igx verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra-5.15/5.15.0-1016.16~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-nvidia-tegra-5.15 verification-needed-focal
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra/5.15.0-1016.16 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-tegra
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (44.9 KiB)

This bug was fixed in the package linux - 6.2.0-27.28

---------------
linux (6.2.0-27.28) lunar; urgency=medium

  * lunar/linux: 6.2.0-27.28 -proposed tracker (LP: #2026488)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] update annotations scripts

  * CVE-2023-2640 // CVE-2023-32629
    - Revert "UBUNTU: SAUCE: overlayfs: handle idmapped mounts in
      ovl_do_(set|remove)xattr"
    - Revert "UBUNTU: SAUCE: overlayfs: Skip permission checking for
      trusted.overlayfs.* xattrs"
    - SAUCE: overlayfs: default to userxattr when mounted from non initial user
      namespace

  * UNII-4 5.9G Band support request on 8852BE (LP: #2023952)
    - wifi: rtw89: 8851b: add 8851B basic chip_info
    - wifi: rtw89: introduce realtek ACPI DSM method
    - wifi: rtw89: regd: judge UNII-4 according to BIOS and chip
    - wifi: rtw89: support U-NII-4 channels on 5GHz band

  * Disable hv-kvp-daemon if /dev/vmbus/hv_kvp is not present (LP: #2024900)
    - [Packaging] disable hv-kvp-daemon if needed

  * A deadlock issue in scsi rescan task while resuming from S3 (LP: #2018566)
    - ata: libata-scsi: Avoid deadlock on rescan after device resume

  * [SRU] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU (LP: #2008745)
    - [Config] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU

  * Lunar update: v6.2.15 upstream stable release (LP: #2025067)
    - ASOC: Intel: sof_sdw: add quirk for Intel 'Rooks County' NUC M15
    - ASoC: Intel: soc-acpi: add table for Intel 'Rooks County' NUC M15
    - ASoC: soc-pcm: fix hw->formats cleared by soc_pcm_hw_init() for dpcm
    - x86/hyperv: Block root partition functionality in a Confidential VM
    - ASoC: amd: yc: Add DMI entries to support Victus by HP Laptop 16-e1xxx
      (8A22)
    - iio: adc: palmas_gpadc: fix NULL dereference on rmmod
    - ASoC: Intel: bytcr_rt5640: Add quirk for the Acer Iconia One 7 B1-750
    - ASoC: da7213.c: add missing pm_runtime_disable()
    - net: wwan: t7xx: do not compile with -Werror
    - wifi: mt76: mt7921: Fix use-after-free in fw features query.
    - selftests mount: Fix mount_setattr_test builds failed
    - scsi: mpi3mr: Handle soft reset in progress fault code (0xF002)
    - net: sfp: add quirk enabling 2500Base-x for HG MXPD-483II
    - platform/x86: thinkpad_acpi: Add missing T14s Gen1 type to s2idle quirk list
    - wifi: ath11k: reduce the MHI timeout to 20s
    - tracing: Error if a trace event has an array for a __field()
    - asm-generic/io.h: suppress endianness warnings for readq() and writeq()
    - asm-generic/io.h: suppress endianness warnings for relaxed accessors
    - x86/cpu: Add model number for Intel Arrow Lake processor
    - wifi: mt76: mt7921e: Set memory space enable in PCI_COMMAND if unset
    - ASoC: amd: ps: update the acp clock source.
    - arm64: Always load shadow stack pointer directly from the task struct
    - arm64: Stash shadow stack pointer in the task struct on interrupt
    - powerpc/boot: Fix boot wrapper code generation with CONFIG_POWER10_CPU
    - PCI: kirin: Select REGMAP_MMIO
    - PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock
    - PC...

Changed in linux (Ubuntu Lunar):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (83.7 KiB)

This bug was fixed in the package linux - 5.15.0-79.86

---------------
linux (5.15.0-79.86) jammy; urgency=medium

  * jammy/linux: 5.15.0-79.86 -proposed tracker (LP: #2026531)

  * Jammy update: v5.15.111 upstream stable release (LP: #2025095)
    - ASOC: Intel: sof_sdw: add quirk for Intel 'Rooks County' NUC M15
    - ASoC: soc-pcm: fix hw->formats cleared by soc_pcm_hw_init() for dpcm
    - x86/hyperv: Block root partition functionality in a Confidential VM
    - iio: adc: palmas_gpadc: fix NULL dereference on rmmod
    - ASoC: Intel: bytcr_rt5640: Add quirk for the Acer Iconia One 7 B1-750
    - selftests mount: Fix mount_setattr_test builds failed
    - asm-generic/io.h: suppress endianness warnings for readq() and writeq()
    - x86/cpu: Add model number for Intel Arrow Lake processor
    - wireguard: timers: cast enum limits members to int in prints
    - wifi: mt76: mt7921e: Set memory space enable in PCI_COMMAND if unset
    - arm64: Always load shadow stack pointer directly from the task struct
    - arm64: Stash shadow stack pointer in the task struct on interrupt
    - PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock
    - PCI: qcom: Fix the incorrect register usage in v2.7.0 config
    - IMA: allow/fix UML builds
    - USB: dwc3: fix runtime pm imbalance on probe errors
    - USB: dwc3: fix runtime pm imbalance on unbind
    - hwmon: (k10temp) Check range scale when CUR_TEMP register is read-write
    - hwmon: (adt7475) Use device_property APIs when configuring polarity
    - posix-cpu-timers: Implement the missing timer_wait_running callback
    - blk-mq: release crypto keyslot before reporting I/O complete
    - blk-crypto: make blk_crypto_evict_key() return void
    - blk-crypto: make blk_crypto_evict_key() more robust
    - ext4: use ext4_journal_start/stop for fast commit transactions
    - staging: iio: resolver: ads1210: fix config mode
    - tty: Prevent writing chars during tcsetattr TCSADRAIN/FLUSH
    - xhci: fix debugfs register accesses while suspended
    - tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
    - MIPS: fw: Allow firmware to pass a empty env
    - ipmi:ssif: Add send_retries increment
    - ipmi: fix SSIF not responding under certain cond.
    - kheaders: Use array declaration instead of char
    - wifi: mt76: add missing locking to protect against concurrent rx/status
      calls
    - pwm: meson: Fix axg ao mux parents
    - pwm: meson: Fix g12a ao clk81 name
    - soundwire: qcom: correct setting ignore bit on v1.5.1
    - pinctrl: qcom: lpass-lpi: set output value before enabling output
    - ring-buffer: Sync IRQ works before buffer destruction
    - crypto: api - Demote BUG_ON() in crypto_unregister_alg() to a WARN_ON()
    - crypto: safexcel - Cleanup ring IRQ workqueues on load failure
    - rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-
      ed
    - reiserfs: Add security prefix to xattr name in reiserfs_security_write()
    - KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted
    - relayfs: fix out-of-bounds access in relay_file_read
    - writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
 ...

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws/5.15.0-1044.49 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-aws' to 'verification-done-jammy-linux-aws'. If the problem still exists, change the tag 'verification-needed-jammy-linux-aws' to 'verification-failed-jammy-linux-aws'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws-v2 verification-needed-jammy-linux-aws
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-6.2/6.2.0-1011.11~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-aws-6.2' to 'verification-done-jammy-linux-aws-6.2'. If the problem still exists, change the tag 'verification-needed-jammy-linux-aws-6.2' to 'verification-failed-jammy-linux-aws-6.2'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws-6.2-v2 verification-needed-jammy-linux-aws-6.2
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/6.2.0-1011.11 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar-linux-azure' to 'verification-done-lunar-linux-azure'. If the problem still exists, change the tag 'verification-needed-lunar-linux-azure' to 'verification-failed-lunar-linux-azure'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-azure-v2 verification-needed-lunar-linux-azure
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-riscv/6.2.0-31.31.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar-linux-riscv' to 'verification-done-lunar-linux-riscv'. If the problem still exists, change the tag 'verification-needed-lunar-linux-riscv' to 'verification-failed-lunar-linux-riscv'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-riscv-v2 verification-needed-lunar-linux-riscv
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-starfive/6.2.0-1003.3 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar-linux-starfive' to 'verification-done-lunar-linux-starfive'. If the problem still exists, change the tag 'verification-needed-lunar-linux-starfive' to 'verification-failed-lunar-linux-starfive'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-starfive-v2 verification-needed-lunar-linux-starfive
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1046.53 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-azure' to 'verification-done-jammy-linux-azure'. If the problem still exists, change the tag 'verification-needed-jammy-linux-azure' to 'verification-failed-jammy-linux-azure'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure-v2 verification-needed-jammy-linux-azure
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-5.15/5.15.0-1046.51~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-done-focal-linux-aws-5.15'. If the problem still exists, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-failed-focal-linux-aws-5.15'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-aws-5.15-v2 verification-needed-focal-linux-aws-5.15
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-xilinx-zynqmp/5.15.0-1024.28 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp' to 'verification-done-jammy-linux-xilinx-zynqmp'. If the problem still exists, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp' to 'verification-failed-jammy-linux-xilinx-zynqmp'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-xilinx-zynqmp-v2 verification-needed-jammy-linux-xilinx-zynqmp
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.