2019-03-04 09:47:05 |
Christian Ehrhardt |
description |
Hi, Christian
We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake
server. I can always reproduce this crash and I could narrowed it down. I guess
it could be a GCC issue.
[1] How to reproduce
- ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02.1
- Ubuntu 18.04 on Intel Skylake server
- gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
- Testpmd crashes when it starts to forward traffic. Easy to reproduce.
- Only happens on the Skylake server.
- DPDK 18.05 and later don't have such issue. git-bisect gives no clue.
This is because I enabled MEMPOOL_DEBUG and MLX5_DEBUG. As mempool/rte_memcpy is
inlined function, it should be affected. Now I can see the crash regardlessly -
18.02, 18.05 and 18.08.
[2] Failure point
The attached patch gives an insight of why it crashes. The following is the
result of the patch and the GDB commands.
In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(),
there's rte_memcpy() to move the array of objects to the lcore cache. If I run
memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in
src. And it looks like some of data got shifted by a few bytes as you can see
below.
[GDB command]
$dst = 0x7ffff4e09ea8
$src = 0x7fffce3fb970
$n = 256
x/32gx 0x7ffff4e09ea8
x/32gx 0x7fffce3fb970
testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed.
Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffce3ff700 (LWP 69913)]
(gdb) x/32gx 0x7ffff4e09ea8
0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500
0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180
0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3
0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80
0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff
0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380
0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000
0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80
0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900
0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580
0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200
0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80
0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00
0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780
0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400
0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080
(gdb) x/32gx 0x7fffce3fb970
0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500
0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180
0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00
0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80
0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700
0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380
0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000
0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80
0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900
0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580
0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200
0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80
0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00
0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780
0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400
0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080
AFAIK, AVX512F support is disabled by default in DPDK as it is still
experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2
version of rte_memcpy() seems to be optimized with 512b instructions. If I
disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't
crash.
Do you have any idea regarding this issue or are you already aware of it?
Thanks,
Yongseok
$ git diff
diff --git a/config/common_base b/config/common_base
index ad03cf433..f512b5a88 100644
--- a/config/common_base
+++ b/config/common_base
@@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8
#
# Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD
#
-CONFIG_RTE_LIBRTE_MLX5_PMD=n
-CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
+CONFIG_RTE_LIBRTE_MLX5_PMD=y
+CONFIG_RTE_LIBRTE_MLX5_DEBUG=y
CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n
CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8
@@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
#
CONFIG_RTE_LIBRTE_MEMPOOL=y
CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
-CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
+CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y
#
# Compile Mempool drivers
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..9f48028d9 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -39,6 +39,7 @@
#include <errno.h>
#include <inttypes.h>
#include <sys/queue.h>
+#include <assert.h>
#include <rte_config.h>
#include <rte_spinlock.h>
@@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
/* Add elements back into the cache */
rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+ if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) {
+ printf("[GDB command] \n"
+ "$dst = %p\n"
+ "$src = %p\n"
+ "$n = %ld\n"
+ "x/%ldgx %p\n"
+ "x/%ldgx %p\n",
+ (void *)&cache_objs[0],
+ (const void *)obj_table,
+ sizeof(void *) * n,
+ sizeof(void *) * n / 8, (void *)&cache_objs[0],
+ sizeof(void *) * n / 8, (const void *)obj_table
+ );
+ assert(0);
+ }
+
cache->len += n;
if (cache->len >= cache->flushthresh) { |
[Impact]
* Crashing on certain SkyLake Chips
* Follow upstream disabling one of the gcc options
[Test Case]
* Part of the MRE bug 1817675 following the MRE verficiation process as
defined there.
[Regression Potential]
* Rebuilds with the new code using DPDK headers will be slightly slower
(not using the feature) but avoiding the crash. The slowdown should
be negligible for most cases and the crash avoidance outweigh this.
[Other Info]
* n/a
---
Hi, Christian
We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake
server. I can always reproduce this crash and I could narrowed it down. I guess
it could be a GCC issue.
[1] How to reproduce
- ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02.1
- Ubuntu 18.04 on Intel Skylake server
- gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
- Testpmd crashes when it starts to forward traffic. Easy to reproduce.
- Only happens on the Skylake server.
- DPDK 18.05 and later don't have such issue. git-bisect gives no clue.
This is because I enabled MEMPOOL_DEBUG and MLX5_DEBUG. As mempool/rte_memcpy is
inlined function, it should be affected. Now I can see the crash regardlessly -
18.02, 18.05 and 18.08.
[2] Failure point
The attached patch gives an insight of why it crashes. The following is the
result of the patch and the GDB commands.
In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(),
there's rte_memcpy() to move the array of objects to the lcore cache. If I run
memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in
src. And it looks like some of data got shifted by a few bytes as you can see
below.
[GDB command]
$dst = 0x7ffff4e09ea8
$src = 0x7fffce3fb970
$n = 256
x/32gx 0x7ffff4e09ea8
x/32gx 0x7fffce3fb970
testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed.
Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffce3ff700 (LWP 69913)]
(gdb) x/32gx 0x7ffff4e09ea8
0x7ffff4e09ea8: 0x00007fffaac38ec0 0x00007fffaac38500
0x7ffff4e09eb8: 0x00007fffaac37b40 0x00007fffaac37180
0x7ffff4e09ec8: 0x850000007fffaac3 0x7b4000007fffaac3
0x7ffff4e09ed8: 0x00007fffaac35440 0x00007fffaac34a80
0x7ffff4e09ee8: 0xaac3850000007fff 0xaac37b4000007fff
0x7ffff4e09ef8: 0x00007fffaac32d40 0x00007fffaac32380
0x7ffff4e09f08: 0x7fffaac385000000 0x7fffaac37b400000
0x7ffff4e09f18: 0x00007fffaac30640 0x00007fffaac2fc80
0x7ffff4e09f28: 0x00007fffaac2f2c0 0x00007fffaac2e900
0x7ffff4e09f38: 0x00007fffaac2df40 0x00007fffaac2d580
0x7ffff4e09f48: 0x00007fffaac2cbc0 0x00007fffaac2c200
0x7ffff4e09f58: 0x00007fffaac2b840 0x00007fffaac2ae80
0x7ffff4e09f68: 0x00007fffaac2a4c0 0x00007fffaac29b00
0x7ffff4e09f78: 0x00007fffaac29140 0x00007fffaac28780
0x7ffff4e09f88: 0x00007fffaac27dc0 0x00007fffaac27400
0x7ffff4e09f98: 0x00007fffaac26a40 0x00007fffaac26080
(gdb) x/32gx 0x7fffce3fb970
0x7fffce3fb970: 0x00007fffaac38ec0 0x00007fffaac38500
0x7fffce3fb980: 0x00007fffaac37b40 0x00007fffaac37180
0x7fffce3fb990: 0x00007fffaac367c0 0x00007fffaac35e00
0x7fffce3fb9a0: 0x00007fffaac35440 0x00007fffaac34a80
0x7fffce3fb9b0: 0x00007fffaac340c0 0x00007fffaac33700
0x7fffce3fb9c0: 0x00007fffaac32d40 0x00007fffaac32380
0x7fffce3fb9d0: 0x00007fffaac319c0 0x00007fffaac31000
0x7fffce3fb9e0: 0x00007fffaac30640 0x00007fffaac2fc80
0x7fffce3fb9f0: 0x00007fffaac2f2c0 0x00007fffaac2e900
0x7fffce3fba00: 0x00007fffaac2df40 0x00007fffaac2d580
0x7fffce3fba10: 0x00007fffaac2cbc0 0x00007fffaac2c200
0x7fffce3fba20: 0x00007fffaac2b840 0x00007fffaac2ae80
0x7fffce3fba30: 0x00007fffaac2a4c0 0x00007fffaac29b00
0x7fffce3fba40: 0x00007fffaac29140 0x00007fffaac28780
0x7fffce3fba50: 0x00007fffaac27dc0 0x00007fffaac27400
0x7fffce3fba60: 0x00007fffaac26a40 0x00007fffaac26080
AFAIK, AVX512F support is disabled by default in DPDK as it is still
experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2
version of rte_memcpy() seems to be optimized with 512b instructions. If I
disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't
crash.
Do you have any idea regarding this issue or are you already aware of it?
Thanks,
Yongseok
$ git diff
diff --git a/config/common_base b/config/common_base
index ad03cf433..f512b5a88 100644
--- a/config/common_base
+++ b/config/common_base
@@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8
#
# Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD
#
-CONFIG_RTE_LIBRTE_MLX5_PMD=n
-CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
+CONFIG_RTE_LIBRTE_MLX5_PMD=y
+CONFIG_RTE_LIBRTE_MLX5_DEBUG=y
CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n
CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8
@@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
#
CONFIG_RTE_LIBRTE_MEMPOOL=y
CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
-CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
+CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y
#
# Compile Mempool drivers
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..9f48028d9 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -39,6 +39,7 @@
#include <errno.h>
#include <inttypes.h>
#include <sys/queue.h>
+#include <assert.h>
#include <rte_config.h>
#include <rte_spinlock.h>
@@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
/* Add elements back into the cache */
rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+ if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) {
+ printf("[GDB command] \n"
+ "$dst = %p\n"
+ "$src = %p\n"
+ "$n = %ld\n"
+ "x/%ldgx %p\n"
+ "x/%ldgx %p\n",
+ (void *)&cache_objs[0],
+ (const void *)obj_table,
+ sizeof(void *) * n,
+ sizeof(void *) * n / 8, (void *)&cache_objs[0],
+ sizeof(void *) * n / 8, (const void *)obj_table
+ );
+ assert(0);
+ }
+
cache->len += n;
if (cache->len >= cache->flushthresh) { |
|