[SRU] Duplicate Device_dax ids Created and hence Probing is Failing.

Bug #2028158 reported by Gnanendra Kolla
24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Mantic
Jammy
In Progress
Medium
Michael Reed
Lunar
Fix Released
Undecided
Unassigned
Mantic
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
Description of problem:

Observed device_dax related probe errors in dmesg when HBM CPU is set to flat mode. Duplicate device_dax ids were created and hence probing is failing.

How reproducible:
Frequently

Version-Release
Release: 22.04.2, 22.10

[Test Case]

Steps to Reproduce:
1. Set HBM cpu to flat mode in memory settings in BIOS.
2. Boot to the OS.
3. Perform OS warm boot cycle test.
4. Observe the dax2.0/dax3.0/dax4.0/dax5.0 probe error.

Actual results:
Observed device_dax related errors in dmesg, device Dax is creating dummy/duplicate devices and probe failing.

Expected results:
Dummy/duplicate devices should not create.

[Fix]
Upstream Fix
https://lore.kernel<email address hidden>/T/

By default this is enabled, but it is causing an issue reconfiguring device dax memory, thus
it is being disabled.
Disable CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
Set CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=N

[Where problems could occur]

[Other Info]
Additional info:
SUT is having 2*32C HBM cpus. Eligible system-ram mode change devices should be only 2[dax0.0, dax1.0], but under "daxctl list -u" is showing 1st time 4 devices [dax0.0, 1.0, 2.0, 3.0], 2 is "state":"disabled" and 2 more devices is "mode":"devdax" which are actuall devadax to system-ram convertible devices. After reconfigure-device dax0.0, dax1.0 when you list the devices couple of more dummy/dumplicate devices are creating with "state":"disabled"[Ex: dax4.0, 5.0 etc..].

root@ubuntu:/home/ubuntu# daxctl list -u
[
  {
    "chardev":"dax1.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":3,
    "align":2097152,
    "mode":"devdax"---------------> HBM CPU 1, This we can change the devdax to
                                    system-ram
  },
  {
    "chardev":"dax2.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":2, --------------------> Duplicate device
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax3.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":3, --------------------> Duplicate device
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax0.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":2,
    "align":2097152,
    "mode":"devdax" ---------------> HBM CPU 1, This we can change the devdax to
                                    system-ram
  }
]
root@ubuntu:/home/ubuntu# dmesg | grep -i error
[ 12.748884] device_dax: probe of dax2.0 failed with error -16
[ 12.748902] device_dax: probe of dax3.0 failed with error -16

After reconfig-device devdax to system-ram below are the results:
-------------------------------------------------------------------
root@ubuntu:/home/ubuntu# daxctl reconfigure-device -m system-ram dax0.0 -u
{
  "chardev":"dax0.0",
  "size":"64.00 GiB (68.72 GB)",
  "target_node":2,
  "align":2097152,
  "mode":"system-ram",
  "online_memblocks":32,
  "total_memblocks":32,
  "movable":true
}
reconfigured 1 device
root@ubuntu:/home/ubuntu# daxctl reconfigure-device -m system-ram dax1.0 -u
{
  "chardev":"dax1.0",
  "size":"64.00 GiB (68.72 GB)",
  "target_node":3,
  "align":2097152,
  "mode":"system-ram",
  "online_memblocks":32,
  "total_memblocks":32,
  "movable":true
}
reconfigured 1 device
root@ubuntu:/home/ubuntu# daxctl list -u
[
  {
    "chardev":"dax4.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":2, --------------------> Duplicate device
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax1.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":3,
    "align":2097152,
    "mode":"system-ram",-----------> Converted from devdax - system-ram
    "online_memblocks":32,
    "total_memblocks":32,
    "movable":true
  },
  {
    "chardev":"dax5.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":3, --------------------> Duplicate device
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax2.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":2, --------------------> Duplicate device
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax3.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":3, --------------------> Duplicate device
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax0.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":2,
    "align":2097152,
    "mode":"system-ram", -----------> Converted from devdax - system-ram
    "online_memblocks":32,
    "total_memblocks":32,
    "movable":true
  }
]

root@ubuntu:/home/ubuntu# dmesg | grep -i dax
[ 12.748880] device_dax dax2.0: mapping0: 0x2080000000-0x307fffffff could not reserve range
[ 12.748884] device_dax: probe of dax2.0 failed with error -16
[ 12.748901] device_dax dax3.0: mapping0: 0x5080000000-0x607fffffff could not reserve range
[ 12.748902] device_dax: probe of dax3.0 failed with error -16
[ 812.677056] device_dax dax4.0: mapping0: 0x2080000000-0x307fffffff could not reserve range
[ 812.677070] device_dax: probe of dax4.0 failed with error -16
[ 821.092762] device_dax dax5.0: mapping0: 0x5080000000-0x607fffffff could not reserve range
[ 821.092774] device_dax: probe of dax5.0 failed with error -16

information type: Public → Private
description: updated
Revision history for this message
Gnanendra Kolla (gnanendrakollaa) wrote (last edit ):

Found Upstream kernel patch for hmem duplicate dax_device creation, applied on top of ubuntu 22.04.2& 22.10 working as expected, not observed any dax related errors on dmesg and no duplicate dax_device creations.
https://lore.kernel<email address hidden>/T/

Please consider the above patch to pull into ubuntu release.

After applied patch test results:
-------------------------------
root@ubuntu:/home/ubuntu# daxctl list -u
[
  {
    "chardev":"dax1.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":3,
    "align":2097152,
    "mode":"devdax"
  },
  {
    "chardev":"dax0.0",
    "size":"64.00 GiB (68.72 GB)",
    "target_node":2,
    "align":2097152,
    "mode":"devdax"
  }
]

ubuntu@ubuntu:~$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online yes 0
0x0000000100000000-0x000000607fffffff 382G online yes 2-192

Memory block size: 2G
Total online memory: 384G
Total offline memory: 0B

description: updated
Revision history for this message
Gnanendra Kolla (gnanendrakollaa) wrote (last edit ):

Observation:
=============
When reconfiguring dax devices then we are seeing below auto onlines hot plugged memory error

root@ubuntu:/home/ubuntu# daxctl reconfigure-device -m system-ram dax0.0 -u
dax0.0: error: kernel policy will auto-online memory, aborting.
error reconfiguring devices: Device or resource busy
reconfigured 0 devices.

Checked Kernel config "auto onlines hot plugged memory" enabled and online and it was online then i have made it offline then again reconfigured the dax device it reconfigured successfully.

there may be udev rules that interfere with memory onlining. They may race to online memory into ZONE_NORMAL rather than movable.

Is this working as expected? or leading to bug when we reconfigure device dax memory?

Revision history for this message
Michael Reed (mreed8855) wrote :

# grep ONLINE /boot/config-$(uname -r)
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

Revision history for this message
Michael Reed (mreed8855) wrote :
Revision history for this message
Gnanendra Kolla (gnanendrakollaa) wrote :

Hi Michael,

As you given test kernel does it includes comment#2 fix or only upstream patch included. Bcoz our internal team has raised the JIT for same issue.

Revision history for this message
Gnanendra Kolla (gnanendrakollaa) wrote :

Hello Michael,

Duplicate 'hmem' (dax) devices creation is fixed in provided test kernel.

When reconfiguring dax devices then we are seeing below auto onlines hot plugged memory error seen.

root@ubuntu:/home/ubuntu# daxctl reconfigure-device -m system-ram dax0.0 -u
dax0.0: error: kernel policy will auto-online memory, aborting.
error reconfiguring devices: Device or resource busy
reconfigured 0 devices.

# grep ONLINE /boot/config-$(uname -r)
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

Revision history for this message
Michael Reed (mreed8855) wrote :

I have created a new test kernel with the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE disabled:

https://people.canonical.com/~mreed/dell/lp_2028158_device_dax/2nd/

Revision history for this message
Gnanendra Kolla (gnanendrakollaa) wrote (last edit ):

Hi Michael,

I have tested the test kernel linux-5.15.0-83-generic_5.15.0-83.92.

CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is disabled.

working as expected, no issues observed.

Revision history for this message
Michael Reed (mreed8855) wrote (last edit ):

This is patch is already in Lunar and Mantic

Changed in linux (Ubuntu Jammy):
status: New → In Progress
Changed in linux (Ubuntu Mantic):
status: New → Fix Released
Michael Reed (mreed8855)
Changed in linux (Ubuntu Lunar):
status: New → Fix Released
Michael Reed (mreed8855)
description: updated
Michael Reed (mreed8855)
summary: - Observed device_dax related probe errors in dmesg when HBM CPU is set to
- flat mode and creating duplicate device_dax ids and hence probe is
- failing.
+ [SRU] Duplicate device_dax ids created and hence probing is failing.
summary: - [SRU] Duplicate device_dax ids created and hence probing is failing.
+ [SRU] Duplicate Device_dax ids Created and hence Probing is Failing.
description: updated
information type: Private → Public
Changed in linux (Ubuntu Jammy):
assignee: nobody → Michael Reed (mreed8855)
importance: Undecided → Medium
Michael Reed (mreed8855)
description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.