I´m running a server ob Ubuntu 16.04. Data storage on this system uses a lvm on top of a software raid 6 whereas the os is installed on a separate raid 1 with lvm. The raid 6 is built up of 7 partitions on 7 disks. After a raising amount of s.m.a.r.t errors on one of these disks, i decided to swap this disk to a new one before the array becomes degraded.
I did a sudo mdadm /dev/md2 --fail /dev/sdd4
, follwed by sudo mdadm /dev/md2 --remove /dev/sdd4
before i swapped the disks.
After the next startup everything seemed to be okay so i started partitioning the new disk. I did a sudo parted --list
to adapt the partitioning of the other disks.
At this moment a strange problem occured and parted had problems accessing one of the old disk. I noticed that another drive has gone from the array and some seconds later another one. The array failed. I was shocked and shut the system down to prevent further errors.
Later i tried to start the system again and had strange failures like these:
ata2: irq_stat 0x00000040, connection status changed
ata2: SError: { CommWake DevExch }
I only had a emergency console at this time so i started a live linux to further inspect the problem. I´ve read that i could securely do a mdadm --assemble --scan
to try to fix the array but it remains in a curious state so i only removed the array from mdadm.conf
and fstab
.
The raid is now shown as a raid0 with 7 spare drives but the drives seem to be running okay the last hours in the remaining RAID1.
I´m unsure what i should do right now and i expect loosing all the data but i´m also hoping that there is a chance to rescue at least a part of the data. I have a backup but only a part because it was a 19TB array.
chris@uranus:~$ sudo mdadm --detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 3744766464 (3571.29 GiB 3834.64 GB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Jul 13 17:39:04 2018
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : uranus:2 (local to host uranus)
UUID : 607914eb:666e2a46:b2e43557:02cc2983
Events : 2738806
Number Major Minor RaidDevice State
0 8 20 0 active sync /dev/sdb4
1 8 36 1 active sync /dev/sdc4
2 8 52 2 active sync /dev/sdd4
6 8 1 3 active sync /dev/sda1
5 8 68 4 active sync /dev/sde4
8 8 97 5 active sync /dev/sdg1
7 8 81 6 active sync /dev/sdf1
chris@uranus:/$ sudo mdadm --detail /dev/md2
/dev/md2:
Version : 1.2
Raid Level : raid0
Total Devices : 6
Persistence : Superblock is persistent
State : inactive
Name : uranus:2 (local to host uranus)
UUID : 607914eb:666e2a46:b2e43557:02cc2983
Events : 2739360
Number Major Minor RaidDevice
- 8 1 - /dev/sda1
- 8 20 - /dev/sdb4
- 8 36 - /dev/sdc4
- 8 68 - /dev/sde4
- 8 81 - /dev/sdf1
- 8 97 - /dev/sdg1
6 drives are not faulty, the 7th one had errors but i switched it to a new one. After the switch of this one faulty drive, smart data is good for all drives. There are no errors, no bad blocks, no pending, uncorrectable or reallocated sectors.
My last --detail
shows only 6 drives because i did not added the new drive to the existing array.
The raid 1 the os relies on, was basically on 3 + 1 spare of the same 7 disks but on own partitions. As i remove /dev/sdd, the spare drive tooked it´s place so it consists now of 3 partitions without a spare. There are also boot partitions on 3 of these disks and swap partitions in a raid 1 on 2 of these disks.
The problem is, that mdadm shows this array now as a raid 0 with 7 spares as cat /proc/mdstat
shows and i have to get it to it´s original raid6 config with 6 out of 7 drives in its degraded state. There seems to be a problem with the config but i didn´t changed anything in this. After and only if i could restore the array, i would add the switched 7th partition back to the array to get my original 7th drive raid6.
If i read the manpage right, mdadm --assemble --scan
restores array information based on the config or /proc/mdstat
but these seem to be already wrong.
cat /proc/mdstat
- Now
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : inactive sdg1[8](S) sdf1[7](S) sdb4[0](S) sda1[6](S) sde4[5](S) sdc4[1](S)
22468633600 blocks super 1.2
md129 : active raid1 sdb3[0] sde3[4] sdc3[1]
146353024 blocks super 1.2 [3/3] [UUU]
md128 : active raid1 sdb2[0] sde2[4](S) sdc2[1]
15616896 blocks super 1.2 [2/2] [UU]
unused devices: <none>
cat /etc/mdadm/mdadm.conf
- Now
#DEVICE partitions containers
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
ARRAY /dev/md128 metadata=1.2 UUID=6813258b:250929d6:8a1e9d34:422a9fbd name=uranus:128
spares=1
ARRAY /dev/md129 metadata=1.2 UUID=ab06d13f:a70de5a6:c83a9383:b1beb84c name=uranus:129
ARRAY /dev/md2 metadata=1.2 UUID=607914eb:666e2a46:b2e43557:02cc2983 name=uranus:2
# This file was auto-generated on Mon, 10 Aug 2015 18:09:47 +0200
# by mkconf $Id$
#ARRAY /dev/md/128 metadata=1.2 UUID=6813258b:250929d6:8a1e9d34:422a9fbd name=uranus:128
# spares=2
#ARRAY /dev/md/129 metadata=1.2 UUID=ab06d13f:a70de5a6:c83a9383:b1beb84c name=uranus:129
# spares=1
#ARRAY /dev/md/2 metadata=1.2 UUID=607914eb:666e2a46:b2e43557:02cc2983 name=uranus:2
cat /etc/fstab
- Now
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/mapper/vgSystem-vRoot / ext4 errors=remount-ro 0 1
# swap was on /dev/md128 during installation
UUID=5a5b997d-9e94-4391-955f-a2b9a3f63820 none swap sw 0 0
#/dev/vgData/vData /srv ext4 defaults 0 0
#10.10.0.15:/srv/BackupsUranusAutomatic/data /mnt/mars/uranus/automatic/data nfs clientaddr=10.10.0.10,vers=4,noatime,addr=10.10.0.15,noauto 0 0
#10.10.0.15:/srv/BackupsUranusAutomatic/media /mnt/mars/uranus/automatic/media nfs clientaddr=10.10.0.10,vers=4,noatime,addr=10.10.0.15,noauto 0 0
#/srv/shares/Videos/Ungesichert/Videorecorder /srv/vdr/video bind bind 0 0
#/dev/sdh1 /mnt/usbdisk ntfs noatime,noauto 0 0
Disks and partitions - Before the problem occured
Medium /dev/sda: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: 98C35BD3-BFBC-4A4B-AEC9-6D4AFB775AF4
Gerät Start Ende Sektoren Größe Typ
/dev/sda1 2048 7489808383 7489806336 3,5T Linux RAID
/dev/sda2 7489808384 7791525887 301717504 143,9G Linux RAID
Medium /dev/sdb: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: 49102EF7-9FA2-4990-8C30-6C5B463B917E
Gerät Start Ende Sektoren Größe Typ
/dev/sdb1 2048 20479 18432 9M BIOS boot
/dev/sdb2 20480 31270911 31250432 14,9G Linux RAID
/dev/sdb3 31270912 324239359 292968448 139,7G Linux RAID
/dev/sdb4 324239360 7814035455 7489796096 3,5T Linux RAID
Medium /dev/sdc: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: 6A037D00-F252-4CA0-8D68-430734BCA765
Gerät Start Ende Sektoren Größe Typ
/dev/sdc1 2048 20479 18432 9M BIOS boot
/dev/sdc2 20480 31270911 31250432 14,9G Linux RAID
/dev/sdc3 31270912 324239359 292968448 139,7G Linux RAID
/dev/sdc4 324239360 7814035455 7489796096 3,5T Linux RAID
Medium /dev/sdd: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: EADC29D6-C2E9-4AC8-B1B2-F01A5233467C
Gerät Start Ende Sektoren Größe Typ
/dev/sdd1 2048 20479 18432 9M BIOS boot
/dev/sdd2 20480 31270911 31250432 14,9G Linux RAID
/dev/sdd3 31270912 324239359 292968448 139,7G Linux RAID
/dev/sdd4 324239360 7814035455 7489796096 3,5T Linux RAID
Medium /dev/sde: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: 3D7EBBFD-C00D-4503-8BF1-A71534F643E1
Gerät Start Ende Sektoren Größe Typ
/dev/sde1 2048 20479 18432 9M Linux filesystem
/dev/sde2 20480 31270911 31250432 14,9G Linux filesystem
/dev/sde3 31270912 324239359 292968448 139,7G Linux filesystem
/dev/sde4 324239360 7814035455 7489796096 3,5T Linux filesystem
Medium /dev/sdf: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: FCA42FC2-C5E9-45B6-9C18-F103C552993D
Gerät Start Ende Sektoren Größe Typ
/dev/sdf1 2048 7489824767 7489822720 3,5T Linux RAID
/dev/sdf2 7489824768 7791525887 301701120 143,9G Linux RAID
Medium /dev/sdg: 3,7 TiB, 4000787030016 Bytes, 7814037168 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Typ der Medienbezeichnung: gpt
Medienkennung: 8FF8C4CC-6788-47D7-8264-8FA6EF912555
Gerät Start Ende Sektoren Größe Typ
/dev/sdg1 2048 7489824767 7489822720 3,5T Linux RAID
/dev/sdg2 7489824768 7791525887 301701120 143,9G Linux RAID
Medium /dev/md2: 17,4 TiB, 19173204295680 Bytes, 37447664640 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 524288 Bytes / 2621440 Bytes
Medium /dev/md128: 14,9 GiB, 15991701504 Bytes, 31233792 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Medium /dev/md129: 139,6 GiB, 149865496576 Bytes, 292706048 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Medium /dev/mapper/vgSystem-vRoot: 74,5 GiB, 79997960192 Bytes, 156246016 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Medium /dev/mapper/vgData-vData: 17,4 TiB, 19173199577088 Bytes, 37447655424 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 524288 Bytes / 2621440 Bytes
Medium /dev/mapper/vgSystem-testBtrfs: 5 GiB, 5368709120 Bytes, 10485760 Sektoren
Einheiten: sectors von 1 * 512 = 512 Bytes
Sektorengröße (logisch/physisch): 512 Bytes / 4096 Bytes
I/O Größe (minimal/optimal): 4096 Bytes / 4096 Bytes
Disks, partitions, raid devices and volumes - Before the problem occured
NAME UUID FSTYPE MOUNTPOINT LABEL SIZE
sda 3,7T
├─sda1 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
│ └─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
│ └─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
└─sda2 143,9G
sdb 3,7T
├─sdb1 9M
├─sdb2 6813258b-2509-29d6-8a1e-9d34422a9fbd linux_raid_member uranus:128 14,9G
│ └─md128 5a5b997d-9e94-4391-955f-a2b9a3f63820 swap [SWAP] 14,9G
├─sdb3 ab06d13f-a70d-e5a6-c83a-9383b1beb84c linux_raid_member uranus:129 139,7G
│ └─md129 7QXSVM-dauj-RUQ1-uoQp-IamT-TTZo-slzArT LVM2_member 139,6G
│ ├─vgSystem-vRoot fb4bfbb3-de6c-47ef-b237-27af04fa2f4c ext4 / root 74,5G
│ └─vgSystem-testBtrfs 27bbab4c-3c9f-4743-83ac-61e8b41f2bd3 btrfs 5G
└─sdb4 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
└─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
└─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
sdc 3,7T
├─sdc1 9M
├─sdc2 6813258b-2509-29d6-8a1e-9d34422a9fbd linux_raid_member uranus:128 14,9G
│ └─md128 5a5b997d-9e94-4391-955f-a2b9a3f63820 swap [SWAP] 14,9G
├─sdc3 ab06d13f-a70d-e5a6-c83a-9383b1beb84c linux_raid_member uranus:129 139,7G
│ └─md129 7QXSVM-dauj-RUQ1-uoQp-IamT-TTZo-slzArT LVM2_member 139,6G
│ ├─vgSystem-vRoot fb4bfbb3-de6c-47ef-b237-27af04fa2f4c ext4 / root 74,5G
│ └─vgSystem-testBtrfs 27bbab4c-3c9f-4743-83ac-61e8b41f2bd3 btrfs 5G
└─sdc4 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
└─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
└─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
sdd 3,7T
├─sdd1 9M
├─sdd2 6813258b-2509-29d6-8a1e-9d34422a9fbd linux_raid_member uranus:128 14,9G
│ └─md128 5a5b997d-9e94-4391-955f-a2b9a3f63820 swap [SWAP] 14,9G
├─sdd3 ab06d13f-a70d-e5a6-c83a-9383b1beb84c linux_raid_member uranus:129 139,7G
│ └─md129 7QXSVM-dauj-RUQ1-uoQp-IamT-TTZo-slzArT LVM2_member 139,6G
│ ├─vgSystem-vRoot fb4bfbb3-de6c-47ef-b237-27af04fa2f4c ext4 / root 74,5G
│ └─vgSystem-testBtrfs 27bbab4c-3c9f-4743-83ac-61e8b41f2bd3 btrfs 5G
└─sdd4 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
└─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
└─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
sde 3,7T
├─sde1 9M
├─sde2 6813258b-2509-29d6-8a1e-9d34422a9fbd linux_raid_member uranus:128 14,9G
│ └─md128 5a5b997d-9e94-4391-955f-a2b9a3f63820 swap [SWAP] 14,9G
├─sde3 ab06d13f-a70d-e5a6-c83a-9383b1beb84c linux_raid_member uranus:129 139,7G
│ └─md129 7QXSVM-dauj-RUQ1-uoQp-IamT-TTZo-slzArT LVM2_member 139,6G
│ ├─vgSystem-vRoot fb4bfbb3-de6c-47ef-b237-27af04fa2f4c ext4 / root 74,5G
│ └─vgSystem-testBtrfs 27bbab4c-3c9f-4743-83ac-61e8b41f2bd3 btrfs 5G
└─sde4 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
└─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
└─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
sdf 3,7T
├─sdf1 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
│ └─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
│ └─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
└─sdf2 143,9G
sdg 3,7T
├─sdg1 607914eb-666e-2a46-b2e4-355702cc2983 linux_raid_member uranus:2 3,5T
│ └─md2 OTNyDe-fNAP-aLzy-Uwat-yYVH-E11D-d1LyzH LVM2_member 17,4T
│ └─vgData-vData a9b3d18d-e45f-4d0f-ab3d-9fe8bfa42157 ext4 /srv data 17,4T
└─sdg2
mdadm --examine /dev/sd<array-member-harddrives>
- Now
There are only 6 drives because the 7th 'new' drive wasn´t added to the array yet.
chris@uranus:/$ sudo mdadm --examine /dev/sda1
[sudo] Passwort für chris:
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489544192 (3571.29 GiB 3834.65 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=11264 sectors
State : active
Device UUID : 49c6404e:ee9509ba:c980942a:1db9cf3c
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Jul 13 22:34:48 2018
Checksum : aae603a7 - correct
Events : 2739360
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AA.AAAA ('A' == active, '.' == missing, 'R' == replacing)
chris@uranus:/$ sudo mdadm --examine /dev/sdb4
/dev/sdb4:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489533952 (3571.29 GiB 3834.64 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=1024 sectors
State : clean
Device UUID : 61d97294:3ce7cd84:7bb4d5f1:d301c842
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Jul 13 22:42:15 2018
Checksum : 890fbe3d - correct
Events : 2739385
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AA..A.. ('A' == active, '.' == missing, 'R' == replacing)
chris@uranus:/$ sudo mdadm --examine /dev/sdc4
/dev/sdc4:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489533952 (3571.29 GiB 3834.64 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=1024 sectors
State : clean
Device UUID : ee70c4ab:5b65dae7:df3a78f0:e8bdcead
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Jul 13 22:42:15 2018
Checksum : 6d171664 - correct
Events : 2739385
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AA..A.. ('A' == active, '.' == missing, 'R' == replacing)
chris@uranus:/$ sudo mdadm --examine /dev/sde4
/dev/sde4:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489533952 (3571.29 GiB 3834.64 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=1024 sectors
State : clean
Device UUID : 6ce5311f:084ded8e:ba3d4e06:43e38c67
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Jul 13 22:42:15 2018
Checksum : 572b9ac7 - correct
Events : 2739385
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 4
Array State : AA..A.. ('A' == active, '.' == missing, 'R' == replacing)
chris@uranus:/$ sudo mdadm --examine /dev/sdf1
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489560576 (3571.30 GiB 3834.66 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=27648 sectors
State : clean
Device UUID : 7c4fbe19:d63eced4:1b40cf79:e759fe4b
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Jul 13 22:36:17 2018
Checksum : ef93d641 - correct
Events : 2739381
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 6
Array State : AA..A.A ('A' == active, '.' == missing, 'R' == replacing)
chris@uranus:/$ sudo mdadm --examine /dev/sdg1
/dev/sdg1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489560576 (3571.30 GiB 3834.66 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=27648 sectors
State : clean
Device UUID : 36d9dffc:27699128:e84f87e7:38960357
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Jul 13 22:35:47 2018
Checksum : 9f34d651 - correct
Events : 2739377
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 5
Array State : AA..AAA ('A' == active, '.' == missing, 'R' == replacing)
You had one drive down (the one you were replacing) when this happened:
At this moment a strange problem occured and parted had problems accessing one of the old disk. I noticed that another drive has gone from the array and some seconds later another one. The array failed. I was shocked and shut the system down to prevent further errors.
That makes for 3 drives if the fails were fatal.
You stated that the OS was on a RAID 1, I assume those were 2 disks, and the other 7 disks were on the RAID 6.
RAID 6 can withstand the loss of two disks in the array. If you had 3 failures on the RAID 6 array (assuming none of the failed disks were from the RAID 1), and if the disks are not in a good state, then most likely the data is lost.
You could verify the state of each disk with:
sudo smartctl -a /dev/sdX
And then you can find if the 3 disks are out or if it was a fluke. If it was a fluke, and you are sure everything is Ok, that your mdadm.conf and fstab are correct, as your array seems to be inactive, then you could try to force a re-assemble with (warning: dangerous, read disclaimer below):
sudo mdadm --stop /dev/md2
sudo mdadm --assemble --scan --force
Note: On your last --detail
the output shows 6 disks, not 7. /dev/sdd
is missing it seems.
You could paste your mdadm.conf, fstab and LVM VG, LV and partitions to help understand the configuration.
Disclaimer: Trying stuff with a broken RAID is dangerous, I'm recommending steps based on the available information you provided, I can't guarantee it will work or that it won't destroy your data. Run under your own responsibility and at your own risk.
mdadm uses superblocks
to determine how to assemble disks and so on.
In such case it is always very helpful and informative to look at the actual superblock data of a physical drive before doing any actions which write something to disks (eg. mdadm --assemble --scan --force
, which will update mdadm superblocks).
Use mdadm --examine /dev/sd<your-array-member-harddrives>
to see what the superblock contains.
It should give you an impression when something failed, how much offset you have between the disks in terms of writes, and much more.
After having a clear picture of what is the current state on the physical drives, you can come up with a strategy to fix things.
But first of all, I'd also consider that the mainboard/sata-controller/scsi-controller/... has a physical defect. So many disks failing within a very short period of time is unusual (except one had the great idea of using all disks from the same manufacturer, and production batch to build up the raid system), and could indicate that there is a controller problem. Re-building/Re-syncing a damaged raid on a hard disk controller, which is eventually failing will only lead to disaster.
I'll just provide some ideas how/what to analyse to get an overview of the current state:
First section is, not very interesting and should be the same for all array members.
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 607914eb:666e2a46:b2e43557:02cc2983
Name : uranus:2 (local to host uranus)
Creation Time : Thu Aug 6 00:45:41 2015
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 7489544192 (3571.29 GiB 3834.65 GB)
Array Size : 18723832320 (17856.44 GiB 19173.20 GB)
Used Dev Size : 7489532928 (3571.29 GiB 3834.64 GB)
Still not really interesting, offsets may vary in case disks are not equally sized. The UUID is the hard drive UUID, and is unique for each drive.
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=11264 sectors
State : active
Device UUID : 49c6404e:ee9509ba:c980942a:1db9cf3c
Internal Bitmap : 8 sectors from superblock
Here it starts to get interesting, comments starting with #
:
Update Time : Fri Jul 13 22:34:48 2018 # last write activity on the drive
Checksum : aae603a7 - correct # should be equal on all disks, when the state is clean
Events : 2739360 # Events on the disk, in bock/chunk units
Layout : left-symmetric # could be relevant for rebuilding the array, left-symmetric is default for x86
Chunk Size : 512K
Next section is interesting for rebuilding the array, especially when forming the command the Device Role
is important.
Device Role : Active device 3
Array State : AA.AAAA ('A' == active, '.' == missing, 'R' == replacing)
The array state is just informative, but will not help a lot.
First of all we want to get an idea of How far have disks run apart during the failure?
If I recall correctly, there is a threshold of 50
events in mdadm code when trying to assemble --force
. Which means, if we have a difference in events >50
assemble --force
will not work any more. Although having a difference of <50
events will also not guarantee that forcing an assembly will work. In such a case, the only possibility is to re-create the array with exactly the same parameters as it is already, and instruct mdadm to --create --assume-clean
. When one is in the "lucky" situation that all superblocks are available and can be read, then this should be possible quite "easily" but care to be taken.
Event count looks like the first disk gone out first, then last, and then last but one. Difference is <50
, it could eventually get pretty easy.
Events : 2739360
Events : 2739385
Events : 2739385
Events : 2739385
Events : 2739381
Events : 2739377
Properly interpreting the Array State
is only possible with having an eye on the Events
count & Drive Role
.
Device Role : Active device 3
Device Role : Active device 0
Device Role : Active device 1
Device Role : Active device 4
Device Role : Active device 6
Device Role : Active device 5
Array State : AA.AAAA ('A' == active, '.' == missing, 'R' == replacing)
Array State : AA..A.. ('A' == active, '.' == missing, 'R' == replacing)
Array State : AA..A.. ('A' == active, '.' == missing, 'R' == replacing)
Array State : AA..A.. ('A' == active, '.' == missing, 'R' == replacing)
Array State : AA..A.A ('A' == active, '.' == missing, 'R' == replacing)
Array State : AA..AAA ('A' == active, '.' == missing, 'R' == replacing)
mdadm
starts counting at 0
. Drive 2
failed first, then drive 3
, later drive 5
and in the end on drive 6
.
Notice that drive 5
still lists drive 6
as active, as well as drive 3
lists 3
, 5
and 6
as active. So most likely the drives have not updated the superblock when another drive failed.
After seeing the Array States
, I do assume that an automatic assemble --force
will not play out well, as there is no consistency across 5 devices on the Array State
. The array was raid6
with 7 disks, so in this case we would need to have 5 disks that agree on the Array State
and have less than 50 Events difference, which is not the case.
Remember, mdadm
/raid
is built to not loose data. So there are mechanisms in the code, that prevent mdadm
from actions that could harm data. An automated reassembly, even with --force, will only trigger actions that will succeed very likely. If there is not enough/consistent information in the superblocks for mdadm to take a save decision, it will fail.
If you really know what you do, you can simply re-write the superblocks with create --assume-clean
and all the information needed to set the raid back into operation. But this will be a manual task, where you as admin/user have to instruct the software what exactly to do.
I'm not going to provide a copy & paste command here, because I think it is essential in such a situation, that one knows what one does before executing a "repair-my-raid" command. And to deepen knowledge, reading the whole RAID Recovery related articles on the Linux RAID Wiki is essential, and my conclusion for this answer.
https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn https://raid.wiki.kernel.org/index.php/RAID_Recovery https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID
1. Would you suggest first trying --assemble --force, maybe with an overlay file?
In my point of view this is definitely the first option to try. Using an overlay file or not, depends on your data and your risk-willingness. So far, I had in such situations backups and therefore didn't use the overlay option. If you want to be on the safe side, use it. There are a few points which I'd like to highlight in that area:
Do not consider using mdadm < 4.0
versions. Make a backport, or compile a version >= 4.0
. There was some bug in 3.x
that had as a result failing assemble --force
actions which did work nicely with 4.0
.
When trying assemble --force
use --verbose
it will give you a good set of information which can be helpful for further steps and understanding what happened or what failed.
2. If i use a --create --assume-clean, is it a better choice to create the last functioning setup with 6 disks or maybe a setup with only 5 drives that have the highest event count? The Is this even possible? My goal is restoring some important data from the array and no permanent solution.
In your case, where the Event-offset is this small, I think there is nothing wrong with recreating the array with 6/7 disks. If you suspect that the HBA (sata/ide/scsi controller) could have an issue it eventually should be considered to leave out suspected port(s). But that depends on the hardware and wiring. And yes, it would be possible, but is dependent on the raid-type. With raid6 you could try to re-build with 5/7 disks only, technically there should not be any limitation to do that. Important is only, if you re-create it with 5/7 there is definitely no more option for a drive to fail.
3. I have details about the array before the crash occured. According to this i would come up with a mdadm --create --assume-clean --level=6 --raid-devices=7 --size=3744766464 /dev/sdb4 /dev/sdc4 missing /dev/sda1 /dev/sde4 /dev/sdg1 /dev/sdf1 for 6 drives, respectively mdadm --create --assume-clean --level=6 --raid-devices=7 --size=3744766464 /dev/sdb4 /dev/sdc4 missing missing /dev/sde4 /dev/sdg1 /dev/sdf1 on a 5 drive solution. Would you agree with this?
I've not verified the details (drive order, size, missing positions, ...), but the commands look good. Still, as mentioned on the Linux Raid Wiki a recreation should be considered as LAST resort. When needing to do so, I always try to be as specific as possible. Just remember that I last time looked over the mdadm man-page and added all options where I knew the data from the informations I had (e.g. even chunk-size, alignment, ...). There are a lot of defaults which one can omit, but when one is sure about the values, why not be specific.
What I'd try in your situation is the following:
mdadm
up to a version of >=4.0
/proc/mdstat
and use mdadm --stop ...
.dmesg
and smartctl
logs. Try to read from a disk, (eg. dd if=/dev/hda1 of=/dev/null bs=1M count=2048
. Re-check dmesg
and smartctl
logs. Repeat that, add some ibs=
and skip=
. Re-check dmesg
and smartctl
logs. If you do see any resets|timeouts|failures
on ata|sata|scsi|...
HBA stop any procedures on the disk using that hardware.mdadm --assemble --scan --verbose
. This will most likely fail, but it will give you a good overview of what mdadm discovers and provide an idea what would happen when you force
that.--assemble --scan --verbose
, try to --force
it.User contributions licensed under CC BY-SA 3.0