performance drop with ATS enabled
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
High
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
High
|
Joseph Salisbury | ||
Bionic |
Fix Released
|
High
|
Joseph Salisbury |
Bug Description
== Comment: #0 - Michael Ranweiler <email address hidden> - 2018-08-16 09:58:02 ==
Witherspoon cluster now has ATS enabled with driver 396.42, CUDA version 9.2.148. They are running the CORAL benchmark LULESH with and without ATS, and they see a significant performance drop with ATS enabled.
========
Below is the run with ATS:
Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12
Elapsed time = 153.00 (s)
Grind time (us/z/c) = 0.37352393 (per dom) (0.046690491 overall)
FOM = 21417.637 (z/s)
========
Here is the run without ATS:
Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12
Elapsed time = 13.27 (s)
Grind time (us/z/c) = 0.032394027 (per dom) (0.0040492534 overall)
FOM = 246959.11 (z/s)
========
Using ATS on a single node slows down the OpenACC version more than 10 times, and for the version with OpenMP 4.5 and managed memory, they observe a 2x slowdown.
Last comment from NVIDIA (Javier Cabezas - 07/29/2018 11:30 AM):
We think we have found where's the issue.
This behavior reproduces for any two concurrent processes that create CUDA contexts on the GPUs and heavily unmap memory (no need to launch any work on the GPUs). When the problem repros, perf shows that most of the time is spent in mmio_invalidate. However, this only happens when processes register GPUs attached to the same NPU. Thus, if process A, initializes GPU 0 and/or 1, and process B, initializes GPU 2 and/or 3, we don't see the slowdown. This makes sense, because ATSDs on different NPUs are issued independently.
After some code inspection in npu-dma.c (powerpc backend in the Linux kernel), Mark noticed that the problem could be in the utilization of test_and_
#define DEFINE_TESTOP(fn, op, prefix, postfix, eh) \
static __inline__ unsigned long fn( \
unsigned long mask, \
volatile unsigned long *_p) \
{ \
unsigned long old, t; \
unsigned long *p = (unsigned long *)_p; \
__asm__ __volatile__ ( \
prefix \
"1:" PPC_LLARX(
stringify_
PPC405_
PPC_STLCX "%1,0,%3\n" \
"bne- 1b\n" \
postfix \
: "=&r" (old), "=&r" (t) \
: "r" (mask), "r" (p) \
: "cc", "memory"); \
return (old & mask); \
}
According to the PowerPC manual, ldarx creates a memory reservation and a subsequent stwcx instruction from the same processor ensures an atomic read-modify-write operation. However, the reservation can be lost if a different processor executes any store instruction on the same address. That's why "bne- 1b" checks wether stwcx was successful and jumps back to retry, otherwise. Since DEFINE_TESTOP doesn't implement any back-off mechanism, two different processors trying to get an ATSD register can starve each other.
Mark compiled a custom kernel which surrounds the calls to test_and_
ATS OFF
Elapsed time = 16.87 (s)
ATS ON
Elapsed time = 215.56 (s)
ATS ON + Spinlock
Elapsed time = 18.14 (s)
Fixed with the following patch in the powerpc tree:
https:/
== Comment: #1 - Michael Ranweiler <email address hidden> - 2018-08-20 14:56:52 ==
This is now in mainline, too:
https:/
It has some small fuzz to apply to 4.15.0.32-35:
diff --git a/arch/
index 6c8e168e6571.
--- a/arch/
+++ b/arch/
@@ -434,8 +434,9 @@ static int get_mmio_
int i;
for (i = 0; i < npu->mmio_
- if (!test_
- return i;
+ if (!test_bit(i, &npu->mmio_
+ if (!test_
+ return i;
}
return -ENOSPC;
CVE References
tags: | added: architecture-ppc64le bugnameltc-170624 severity-high targetmilestone-inin--- |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → linux (Ubuntu) |
Changed in ubuntu-power-systems: | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: triage-g |
Changed in ubuntu-power-systems: | |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
tags: | added: kernel-da-key |
Changed in linux (Ubuntu): | |
status: | New → Triaged |
Changed in linux (Ubuntu Bionic): | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury) |
Changed in ubuntu-power-systems: | |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | In Progress → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
tags: | added: cscc |
tags: |
added: targetmilestone-inin18042 removed: targetmilestone-inin--- |
I built a test kernel with commit 9eab9901b015. The test kernel can be downloaded from: kernel. ubuntu. com/~jsalisbury /lp1788097
http://
Can you test this kernel and see if it resolves this bug?
Note about installing test kernels: unsigned .deb packages.
* If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
* If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-
Thanks in advance!