EC reconstruct a non-detectable corrupt fragment if one of other fragments is corrupt.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Triaged
|
High
|
Unassigned |
Bug Description
Reconstructor rebuild a corrupt ec fragment that can't be detected when there is corrupt fragment in available fragments.
Reproduce:
1. upload file
```
# generate 10M file
$dd if=/dev/urandom of=10M bs=10M count=1
# upload
$swift --os-storage-url http://
$swift --os-storage-url http://
```
2. corrupt fragment
A normal fragment may become corrupt due to bit-rot or other reasons.
We just write zero to random position in fragment#0 to simulate bit-rot:
```
# check md5 before bit-rot
$md5sum 1650604523.
9acf9e57969a27a
# write zero to position 1000
$dd if=/dev/zero of=1650604523.
# check md5 changed
$md5sum 1650604523.
a170f58736fac75
```
3. reconstruct
Remove fragment#1 and execute reconstructor in primary neighbor node(#2).
```
$swift-
```
After reconstruction, fragment#0 will be quarantined but fragment#1 will be rebuilt successfully.
Fragment#1 etag match it's md5.
4. download
Download fail with etag mismatch:
```
$swift --os-storage-url http://
Error downloading object 'test/10M': 'Error downloading test: md5sum != etag, 1473872ce26c5e0
```
Enviroment:
swiftversion - wallaby
policy_type - erasure_coding
ec_type - isa_l_rs_vand
ec_num_
ec_num_
maybe it's because we don't (can't?) send the md5 of the reconstructing fragment along with the PUT to the restored node? Like *that* (invalid) frag would think it's correct.
It could be the reconstructor is missing an oppertunity to check the etag of the fragments it's recieving (by the end it could have noticed same as the object-server quarantine) - but since there's no two phase ec commit in ssync I don't see how it could notify the reciever.
A future roadmap could include rebuilds moving off the ssync protocol.