the problem lies in the 'thunderx_zip' driver, that is the driver for the hw accelerated zip compressor / decompressor ip block - it kicks in once we select the deflate method for the zram device.
-disable the CRYPTO_DEV_CAVIUM_ZIP kconfig (and void building thunderx_zip kmod) and recompile
As to what is causing it, till yesterday, it appeared as the problem was connected to the arm64 kpti but now i'm sure it is not:
the problem started to appear in 4.13.0-37-generic #42, while 4.13.0-36-generic #40 was immune and the only difference between those two kernels is the arm64 kpti patchset.
You can easily reproduce it in the upstream stable/linux-4.14.y tree too (4.14.26 is affected for example).
$ make defconfig
$ echo "CONFIG_CRYPTO_DEV_CAVIUM_ZIP=m" >> .config
$ make oldconfig
build and install as usual, and then try the reproducer above.
But what i found this morning, is that even the original 4.14.0 release is affected, but that release clearly doesn't contain the kpti patches.
Now what i want to try is:
1) test it on different hardware (one thing that i noticed is that if the thunderx_zip kmod is loaded at boot or later in the board life cycle, that slightly changes the error and that smells a lot like memory corruption)
2) test it with 4.15x and 4.16
Ok, here is what i found so far:
the problem lies in the 'thunderx_zip' driver, that is the driver for the hw accelerated zip compressor / decompressor ip block - it kicks in once we select the deflate method for the zram device.
How way to reproduce it:
# modprobe zram zram0/reset zram0/comp_ algorithm zram0/disksize
# echo 1 > /sys/block/
# echo deflate > /sys/block/
# echo 128M > /sys/block/
# mkfs.ext4 -F /dev/zram0
[stuck forever here]
Two trivial workarounds:
-blacklist the thunderx_zip kmod:
# rmmod thunderx_zip d/blacklist. conf
# echo 'blacklist thunderx_zip' >> /etc/modprobe.
or
-disable the CRYPTO_ DEV_CAVIUM_ ZIP kconfig (and void building thunderx_zip kmod) and recompile
As to what is causing it, till yesterday, it appeared as the problem was connected to the arm64 kpti but now i'm sure it is not:
the problem started to appear in 4.13.0-37-generic #42, while 4.13.0-36-generic #40 was immune and the only difference between those two kernels is the arm64 kpti patchset.
You can easily reproduce it in the upstream stable/linux-4.14.y tree too (4.14.26 is affected for example).
$ make defconfig CRYPTO_ DEV_CAVIUM_ ZIP=m" >> .config
$ echo "CONFIG_
$ make oldconfig
build and install as usual, and then try the reproducer above.
But what i found this morning, is that even the original 4.14.0 release is affected, but that release clearly doesn't contain the kpti patches.
Now what i want to try is:
1) test it on different hardware (one thing that i noticed is that if the thunderx_zip kmod is loaded at boot or later in the board life cycle, that slightly changes the error and that smells a lot like memory corruption)
2) test it with 4.15x and 4.16
I'll write another update when i have more data.