mdadm with Raid5 stuck in uninterruptable sleep
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Hardy |
Fix Released
|
Medium
|
Colin Ian King |
Bug Description
Description: Ubuntu hardy (development branch)
Release: 8.04
Linux ubuntu-beta 2.6.24-12-server #1 SMP Wed Mar 12 22:58:36 UTC 2008 x86_64 GNU/Linux
mdadm:
Installed: 2.6.3+200709292
xfsprogs:
Installed: 2.9.4-2
Raid 5 on five 1TB drives, set up as follows:
mdadm --create /dev/md0 --level=5 --raid-devices=5 /dev/sd[b,c,d,e,f]
mkfs.xfs /dev/md0
mount /dev/md0 /mnt/drive
md0 : active raid5 sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
3907049984 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
The drives are connected to a 5-1 port multiplier again connected to a 2-port SiI 3132 based pciexpress sata controller. Problem does not seem to be related to this, as I can write/read to the drives individually without any trouble.
Copying data do this partition results in a permanent lock on several processes related to it, getting stuck in the D(+) state. Happened four times in a row after 10-40 GB had been copied. I can't kill any of the processes, nor am I able to reboot, have to power cycle.
There are no messages related to it in dmesg or any of the logs, as far as the system is concerned nothing is wrong. After power cycling the array starts rebuilding (as it should), but this rebuild also stops because of the same error.
Problem seems very related to this:
http://
As suggested by this thread, I tried to increase stripe_cache_size. Setting it to 4096 seems to have solved my hang, as I have at the time of writing this copied 1.7TB without error.
http://
If this is the same problem and I'm reading it right, it seems like it's supposed to be fixed already. Not sure though.
Changed in linux: | |
assignee: | ubuntu-kernel-team → colin-king |
status: | Triaged → In Progress |
Sigh. Spoke too soon. Ran mdadm -D while it was beeing copied to, hanged again. 2TB transferred. Guess it's directly related to the number of processes that access to the device. Won't be able to restart it until tomorrow, but I can try any suggestions on the hanged system.