hio: SSD data corruption under stress test
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Kamal Mostafa | ||
Xenial |
Fix Released
|
High
|
Kamal Mostafa | ||
Yakkety |
Fix Released
|
High
|
Kamal Mostafa | ||
Zesty |
Fix Released
|
High
|
Kamal Mostafa |
Bug Description
{forward from James Troup}:
Just to followup to this with a little more information, we have now
reproduced this in the following scenarios:
* Ubuntu kernel 4.4 (i.e. 16.04) and kernel 4.8 (i.e. HWE-Y)
* With and without Bcache involved
* With both XFS and ext4
* With HIO driver versions 2.1.0-23 and 2.1.0-25
* With HIO Firmware 640 and 650
* With and without the following two patches
- https:/
- https:/
In all cases, we applied the following two patches in order to get hio
to build at all with a 4.4 or later kernel:
https:/
https:/
We've confirmed that we can reproduce the corruption on any machine in
Tele2's Vienna facility.
We've confirmed that, other than 1 machine, the 'hio_info' command
says the health is 'OK'.
Our most common reproducer is one of two scenarios:
a) http://
b) http://
In the last example, it's possible to see corruption faster by
increasing the 'count' argument to dd and avoid it by lowering it.
e.g. on the machine I'm currently testing on count=52450 doesn't
appear to show corruption, but a count of even 53000 would show it
immediately every time.
I hope this helps - please let us know what further information we can
provide to debug this problem.
Changed in linux (Ubuntu Yakkety): | |
status: | New → In Progress |
Changed in linux (Ubuntu Xenial): | |
status: | New → In Progress |
Changed in linux (Ubuntu Yakkety): | |
assignee: | nobody → Kamal Mostafa (kamalmostafa) |
Changed in linux (Ubuntu Xenial): | |
assignee: | nobody → Kamal Mostafa (kamalmostafa) |
importance: | Undecided → High |
Changed in linux (Ubuntu Yakkety): | |
importance: | Undecided → High |
tags: | added: canonical-bootstack |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Yakkety): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-xenial removed: verification-needed-xenial |
Huawei's upstream hio driver version 2.1.0.26 introduces two changes relative to the current Ubuntu version:
#1. bio_endio() shim macro sets bi_error field {{ but note bug in implementation! }}
#2. blk_queue_split() call inserted ssd_make_request()
James' initial test results indicate that #2 appears to fix the data corruption. Both fixes should be applied (though #1 needs correction) after sufficient testing confirmation.