mdadm refuses to re-add failed member

Bug #945786 reported by iMac
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
New
Undecided
Unassigned

Bug Description

I have my /home in a three-disk RAID1 configuration (/dev/md1) with a partition on my laptop and a second on an external disk connected via eSATA; A third sits on a third external disk. I booted up with two members degraded (external drive not plugged in) and prior to login, proceeded to use a console to umount, remove and fail the active drive (internal partition member) and stop the RAID1 disk, and then plug in my external, re-starting the /dev/md1 device with the external partition member active and remounting /home. The process is one I have executed many times before and is scripted from a couple of files in /usr/local/bin.

However, this time after logging in with my external member active after executing the process above, and attempting to re-add the internal drive to bring the /dev/md1 device in sync with the external disk I received an error suggesting the add failed. I re-executed the remove, fail, re-add manually with the same outcome as shown on my console below, and filed this bug.

It seems the failed disk thinks it is still active, when I use -Q --examine to interrogate it.

:~# mdadm /dev/md1 -r /dev/sda6
mdadm: hot remove failed for /dev/sda6: No such device or address
:~# mdadm /dev/md1 -f /dev/sda6
mdadm: set device faulty failed for /dev/sda6: No such device
:~# mdadm /dev/md1 -a /dev/sda6
mdadm: /dev/sda6 reports being an active member for /dev/md1, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sda6 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda6" first.
:~# mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdc3[0]
      86003840 blocks [3/1] [U__]

unused devices: <none>
:~# apport-bug mdadm

Here is a quick summary of what I did,
a) My disks were synced on an 11.10 system
b) I upgraded from 11.10 to 12.04 with one member failed (external)
c) After upgrade I failed the active disk (internal), stopped the array, and restarted it with the external disk
d) Attempted to re-add the failed internal disk after logging in

:~# blkid | grep raid_member
/dev/sda6: UUID="eeeb6708-d108-0847-57e9-714c01b7dbc8" TYPE="linux_raid_member"
/dev/sdc3: UUID="eeeb6708-d108-0847-57e9-714c01b7dbc8" TYPE="linux_raid_member"

:~# mdadm -D /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Sun Jul 27 22:53:23 2008
     Raid Level : raid1
     Array Size : 86003840 (82.02 GiB 88.07 GB)
  Used Dev Size : 86003840 (82.02 GiB 88.07 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Mar 3 13:56:05 2012
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : eeeb6708:d1080847:57e9714c:01b7dbc8
         Events : 0.10186827

    Number Major Minor RaidDevice State
       0 8 35 0 active sync /dev/sdc3
       1 0 0 1 removed
       2 0 0 2 removed
:~# mdadm -Q /dev/sdc3
/dev/sdc3: is not an md array
/dev/sdc3: device 0 in 3 device active raid1 /dev/md1. Use mdadm --examine for more detail.

:~# mdadm -Q /dev/sda6
/dev/sda6: is not an md array
/dev/sda6: device 1 in 3 device mismatch raid1 /dev/md1. Use mdadm --examine for more detail.

:~# mdadm -Q /dev/sda6 --examine
/dev/sda6:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : eeeb6708:d1080847:57e9714c:01b7dbc8
  Creation Time : Sun Jul 27 22:53:23 2008
     Raid Level : raid1
  Used Dev Size : 86003840 (82.02 GiB 88.07 GB)
     Array Size : 86003840 (82.02 GiB 88.07 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 1

    Update Time : Sat Mar 3 13:28:57 2012
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 60f50ddb - correct
         Events : 10128612

      Number Major Minor RaidDevice State
this 1 8 6 1 active sync /dev/sda6

   0 0 0 0 0 removed
   1 1 8 6 1 active sync /dev/sda6
   2 2 0 0 2 faulty removed

clearly it is not active (0,8,35,0 is per -D output above), but it thinks it is.

Captured enough.. time to reboot and see what happens; Hopefully an auto-rebuild. I have the third disk in the array separate should some corruption happen here.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: mdadm 3.2.3-2ubuntu1
ProcVersionSignature: Ubuntu 3.2.0-17.27-generic 3.2.6
Uname: Linux 3.2.0-17-generic x86_64
NonfreeKernelModules: fglrx
ApportVersion: 1.94-0ubuntu1
Architecture: amd64
Date: Sat Mar 3 13:33:11 2012
MDadmExamine.dev.sda:
 /dev/sda:
    MBR Magic : aa55
 Partition[0] : 121660182 sectors at 63 (type 07)
 Partition[1] : 503477100 sectors at 121660245 (type 05)
MDadmExamine.dev.sda2:
 /dev/sda2:
    MBR Magic : aa55
 Partition[0] : 78124032 sectors at 63 (type 83)
 Partition[1] : 172007893 sectors at 78124095 (type 05)
MDadmExamine.dev.sda5: Error: command ['/sbin/mdadm', '-E', '/dev/sda5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda5.
MDadmExamine.dev.sda7: Error: command ['/sbin/mdadm', '-E', '/dev/sda7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda7.
MDadmExamine.dev.sdb: Error: command ['/sbin/mdadm', '-E', '/dev/sdb'] failed with exit code 1: mdadm: cannot open /dev/sdb: No medium found
MDadmExamine.dev.sdc:
 /dev/sdc:
    MBR Magic : aa55
 Partition[0] : 104438502 sectors at 63 (type 83)
 Partition[1] : 20498940 sectors at 104438565 (type 0b)
 Partition[2] : 172007893 sectors at 124937505 (type fd)
MDadmExamine.dev.sdc1: Error: command ['/sbin/mdadm', '-E', '/dev/sdc1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdc1.
MDadmExamine.dev.sdc2:
 /dev/sdc2:
    MBR Magic : aa55
MachineType: Hewlett-Packard HP Pavilion dv5 Notebook PC
ProcEnviron:
 LANGUAGE=en
 TERM=xterm
 LANG=en_US.utf8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-17-generic root=UUID=10f8a2ac-5ab7-43a2-bdf8-92eee349e09d ro quiet splash vt.handoff=7
ProcMDstat:
 Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
 md1 : active raid1 sdc3[0]
       86003840 blocks [3/1] [U__]

 unused devices: <none>
SourcePackage: mdadm
UpgradeStatus: Upgraded to precise on 2012-03-03 (0 days ago)
dmi.bios.date: 08/19/2009
dmi.bios.vendor: Hewlett-Packard
dmi.bios.version: F.37
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: 30F2
dmi.board.vendor: Quanta
dmi.board.version: 98.36
dmi.chassis.type: 10
dmi.chassis.vendor: Quanta
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnHewlett-Packard:bvrF.37:bd08/19/2009:svnHewlett-Packard:pnHPPaviliondv5NotebookPC:pvrRev1:rvnQuanta:rn30F2:rvr98.36:cvnQuanta:ct10:cvrN/A:
dmi.product.name: HP Pavilion dv5 Notebook PC
dmi.product.version: Rev 1
dmi.sys.vendor: Hewlett-Packard
mtime.conffile..etc.udev.rules.d.85.mdadm.rules: 2009-01-02T11:08:01

Revision history for this message
iMac (imac-netstatz) wrote :
description: updated
Revision history for this message
iMac (imac-netstatz) wrote :

Still weird on reboot. mdstat seems fine, but the failed member still thinks it is active.

:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb3[0]
      86003840 blocks [3/1] [U__]

unused devices: <none>
:~# mdadm -D /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Sun Jul 27 22:53:23 2008
     Raid Level : raid1
     Array Size : 86003840 (82.02 GiB 88.07 GB)
  Used Dev Size : 86003840 (82.02 GiB 88.07 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Mar 3 14:12:24 2012
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : eeeb6708:d1080847:57e9714c:01b7dbc8
         Events : 0.10187219

    Number Major Minor RaidDevice State
       0 8 19 0 active sync /dev/sdb3
       1 0 0 1 removed
       2 0 0 2 removed
:~# mdadm -Q --examine /dev/sda6
/dev/sda6:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : eeeb6708:d1080847:57e9714c:01b7dbc8
  Creation Time : Sun Jul 27 22:53:23 2008
     Raid Level : raid1
  Used Dev Size : 86003840 (82.02 GiB 88.07 GB)
     Array Size : 86003840 (82.02 GiB 88.07 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 1

    Update Time : Sat Mar 3 13:28:57 2012
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 60f50ddb - correct
         Events : 10128612

      Number Major Minor RaidDevice State
this 1 8 6 1 active sync /dev/sda6

   0 0 0 0 0 removed
   1 1 8 6 1 active sync /dev/sda6
   2 2 0 0 2 faulty removed
:~# mdadm -Q --examine /dev/sdb3
/dev/sdb3:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : eeeb6708:d1080847:57e9714c:01b7dbc8
  Creation Time : Sun Jul 27 22:53:23 2008
     Raid Level : raid1
  Used Dev Size : 86003840 (82.02 GiB 88.07 GB)
     Array Size : 86003840 (82.02 GiB 88.07 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 1

    Update Time : Sat Mar 3 14:12:34 2012
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 2
  Spare Devices : 0
       Checksum : 60f6e218 - correct
         Events : 10187225

      Number Major Minor RaidDevice State
this 0 8 19 0 active sync /dev/sdb3

   0 0 8 19 0 active sync /dev/sdb3
   1 1 0 0 1 faulty removed
   2 2 0 0 2 faulty removed

Revision history for this message
iMac (imac-netstatz) wrote :

The disk was /dev/sdc in the first instance because it was plugged after cd-rom took /dev/sdb. In the second instance it was left plugged in on boot, taking /dev/sdb and leaving /dev/sdc for the cd-rom.

Subsequent reboots with both disks plugged in, and removing my mdadm udev override (removed just in case) have had no change on the outcome.

Revision history for this message
Jools Wills (jools) wrote :

It suggested you remove the old meta data from the disk to re-add but I didn't see that you did that. do that and then try and add it.

Revision history for this message
iMac (imac-netstatz) wrote :

Thanks, the --zero-superblock on the device I want to re-add worked. I think I understand what happened, possibly as a result of some mdadm improvements. A verbose explanation follows.

The complexity I did not share, is that my three-disk RAID1 array is actually between two laptops, each which can sync-to/from an eSATA.

In my case, both my laptop disks were previously operating as active disks in an array at the same time (both technically degraded, one with two members, the other with only one member). The external drive had been in sync with my second laptop, running 11.10. My first laptop which was last synced a few days ago and has now been upgraded to 12.04B1 while operating with one member.

As soon as I deemed 12.04 functionally great, I immediately shut them both down, and tried to do my usual pre-login swap of active disks on the first (12.04) laptop to my external drive where my current home directory was residing. Normally this works just un-mounting/failing/removing/stopping the devices and re-assembling/re-mounting with the external member.

Now, I believe mdadm is smart enough to know that both disks came from active/clean (albeit degraded possibly) md disks, and so it chooses not to let me just re-add one to another as I see fit. Previously, the most recent would be the active if I boot with both attached, and a re-sync would start immediately OR I could start with one member, and switch to another.

Now mdadm blocks if I try and add one previously active member to another when they are out of sync, waiting for me to clear meta on the old active member. This is improved, I will just change my process in this situation based on these assumptions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.