简单的mdadm RAID 1无法激活备用


24

我在RAID 1阵列中(在Ubuntu 12.04 LTS Precise Pangolin上使用)创建了两个2TB HDD分区/dev/sdb1/dev/sdc1)。/dev/md0mdadm

该命令sudo mdadm --detail /dev/md0用于将两个驱动器都指示为活动同步

然后,为了进行测试,我失败了/dev/sdb1,将其删除,然后使用命令再次将其添加sudo mdadm /dev/md0 --add /dev/sdb1

watch cat /proc/mdstat 显示了阵列重建的进度条,但是我不会花数小时来观察它,因此我认为该软件知道它在做什么。

在进度栏不再显示后,cat /proc/mdstat显示:

md0 : active raid1 sdb1[2](S) sdc1[1]
      1953511288 blocks super 1.2 [2/1] [U_]

sudo mdadm --detail /dev/md0显示:

/dev/md0:
        Version : 1.2
  Creation Time : Sun May 27 11:26:05 2012
     Raid Level : raid1
     Array Size : 1953511288 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511288 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon May 28 11:16:49 2012
          State : clean, degraded 
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           Name : Deltique:0  (local to host Deltique)
           UUID : 49733c26:dd5f67b5:13741fb7:c568bd04
         Events : 32365

    Number   Major   Minor   RaidDevice State
       1       8       33        0      active sync   /dev/sdc1
       1       0        0        1      removed

       2       8       17        -      spare   /dev/sdb1

有人告诉我mdadm会自动用备件替换已删除的驱动器,但/dev/sdb1不会移动到预期的位置RaidDevice 1


UPDATE(30 2012年5月):badblocks整个的破坏性读-写测试/dev/sdb没有产生错误,因为预期; 两个硬盘都是新的。

在最新的编辑中,我使用以下命令组装了数组:

sudo mdadm --assemble --force --no-degraded /dev/md0 /dev/sdb1 /dev/sdc1

输出为:

mdadm: /dev/md0 has been started with 1 drive (out of 2) and 1 rebuilding.

重建看起来正在正常进行:

md0 : active raid1 sdc1[1] sdb1[2]
      1953511288 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  0.6% (13261504/1953511288) finish=2299.7min speed=14060K/sec

unused devices: <none>

我现在正在等待此重建,但是我希望/dev/sdb1像我之前尝试重建的五到六次一样成为备用。


更新(2012年5月31日):是的,它仍然是备用的。啊!


更新(2012年6月1日):我正在尝试Adrian Kelly的建议命令:

sudo mdadm --assemble --update=resync /dev/md0 /dev/sdb1 /dev/sdc1

正在等待重建...


更新(2012年6月2日):不,仍然是备用...


更新(2012年6月4日): PB提出了一个我忽略的问题:也许/dev/sdc1是遇到I / O错误。我没有费心检查,/dev/sdc1因为它似乎工作得很好并且是全新的,但是在驱动器末尾出现I / O错误是合理的可能性。

我购买了这些硬盘,所以其中之一已经出现故障也就不足为奇了。另外,它们都不支持SMART,所以难怪它们如此便宜...

这是我刚完成并遵循的数据恢复过程:

  1. sudo mdadm /dev/md0 --fail /dev/sdb1这样我就可以取出了/dev/sdb1
  2. sudo mdadm /dev/md0 --remove /dev/sdb1/dev/sdb1从阵列中删除。
  3. /dev/sdc1 安装在 /media/DtkBk
  4. 格式化/dev/sdb1为ext4。
  5. 安装/dev/sdb1/media/DtkBkTemp
  6. cd /media 在那个领域工作。
  7. sudo chown deltik DtkBkTemp给我(用户名deltik)对该分区的权限。
  8. 复制所有文件和目录: sudo rsync -avzHXShP DtkBk/* DtkBkTemp

更新(2012年6月6日):我按照以下步骤对进行了badblocks破坏性的写模式测试/dev/sdc

  1. sudo umount /media/DtkBk 以允许拆除阵列。
  2. sudo mdadm --stop /dev/md0 停止数组。
  3. sudo badblocks -w -p 1 /dev/sdc -s -v擦除可疑硬盘驱动器,然后在此过程中检查I / O错误。如果存在I / O错误,则不是一个好兆头。希望我可以退款...

我现在已经确认,两个硬盘上都没有输入/输出问题

从所有这些调查来看,我的两个原始问题仍然存在。


我的问题是:

  1. 为什么备用驱动器没有变为活动同步?
  2. 如何使备用驱动器变为活动状态?

Answers:


14

这样做只是简单地将驱动器插入阵列中,而无需实际对其进行任何操作,即,它是阵列的成员,但在阵列中不活动。默认情况下,这将使其变为备用:

sudo mdadm /dev/md0 --add /dev/sdb1

如果您有备用磁盘,则可以通过强制增加阵列的活动驱动器数量来增加它。随着3个驱动器和2 有望成为活跃的,你就需要增加活动次数为3次。

mdadm --grow /dev/md0 --raid-devices=3

raid阵列驱动程序将注意到您正在“缩短”驱动器,然后寻找备用驱动器。找到备用磁盘后,它将作为活动驱动器集成到阵列中。打开一个备用终端,并在其中运行这个比较粗糙的命令行,以保持重新同步进度的选项卡。确保将其键入为一行或使用换行符(\),并且重建完成后,只需在终端中键入Ctrl-C。

while true; do sleep 60; clear; sudo mdadm --detail /dev/md0; echo; cat /proc/mdstat; done

现在,您的阵列将具有两个处于同步状态的活动驱动器,但是由于没有3个驱动器,因此不会100%清洁磁盘。卸下故障驱动器,然后调整阵列大小。注意,--grow标志是有点用词不当的-它可能意味着无论是扩大或缩小:

sudo mdadm /dev/md0 --fail /dev/{failed drive}
sudo mdadm /dev/md0 --remove /dev/{failed drive}
sudo mdadm --grow /dev/md0 --raid-devices=2

关于错误,驱动器(即PATA / SATA端口,电缆或驱动器连接器)的链接问题不足以触发热备用的故障转移,因为内核通常会切换为使用其他“良好”驱动器重置链接到“坏”驱动器的链接。我知道这一点是因为我运行了一个3驱动器阵列,2个热磁盘,1个备用磁盘,并且其中一个驱动器是最近才决定在日志中添加一些内容的。当我测试阵列中的所有驱动器时,所有3个都通过了SMART测试的“长”版,因此,盘片,机械组件或板载控制器都没有问题-留下了不稳定的链接电缆或SATA端口损坏。也许这就是您所看到的。尝试将驱动器切换到其他主板端口,或使用其他电缆,看看是否有所改善。


后续措施:我完成了将镜像扩展到3个驱动器的操作,发生了故障,并从md阵列中卸下了不稳定的驱动器,将电缆热交换为新的驱动器(主板支持此电缆),然后重新添加了驱动器。重新添加后,它立即开始重新驱动器同步。到目前为止,尽管该驱动器被大量使用,但日志中未出现任何错误。因此,是的,驱动器电缆可能会掉落。


链路电缆松动?我买了这个解释,但是我不能再测试了,因为几个月前我重新调整了两个驱动器的用途。我接受此答案作为对我的特定问题的最佳答案,但是另一个很好的答案是这个答案。
Deltik 2013年

作为更新,此答案对大多数人来说仍然是最有用的,这就是为什么我接受了它,但是实际上发生的是我的RAID 1阵列中的一个驱动器损坏了,很可能/dev/sdc1当时是因为/dev/sdc1在读取时/dev/sdb1正在写入,并且/dev/sdb1在写入过程中会透明地重新映射其中的坏扇区。
Deltik

1
要密切关注重新同步过程中做到watch -n 60 cat /proc/mdstat哪里60是刷新之间的秒数。
Erk

8

我遇到了完全相同的问题,就我而言,我发现活动的RAID磁盘在同步期间遇到读取错误。因此,新磁盘已成功成功更新,因此被标记为备用磁盘。

您可能要检查/ var / log / messages和其他系统日志中是否有错误。此外,检查磁盘的SMART状态也可能是一个好主意:
1)运行简短测试:

“ smartctl -t short / dev / sda”

2)显示测试结果:

“ smartctl -l自检/ dev / sda”

在我的情况下,返回的内容如下:

===读取SMART数据部分的开始===
SMART自检日志结构修订号1
Num Test_Description状态剩余寿命(小时)LBA_of_first_error
1扩展的脱机已完成:读取失败90%7564 27134728
2脱机的短时已完成:读取失败90% 7467 1408449701

我必须启动一个实时发行版,然后将有缺陷的磁盘中的数据手动复制到新的(当前为“备用”)磁盘上。


啊哈!我没想到会怀疑活动驱动器是否存在I / O错误。由于某些原因,这些HDD不支持SMART。两个全新HDD上的此错误以及可能的I / O错误?我认为我买错了...无论如何,我现在正在将数据恢复过程应用于我所认为良好的硬盘上。我会尽快更新。
Deltik 2012年

+50回复给你PB。没有人能够正确回答我的问题,但是我认为与其浪费50点声誉,不如说是免费的,我会把它们作为礼物送给您。 欢迎使用Stack Exchange!
Deltik 2012年

3

我遇到了完全相同的问题,并且一直以为我想重新添加到阵列中的第二张磁盘有错误。但这是我的原始磁盘读取错误。

您可以使用进行检查,smartctl -t short /dev/sdX并在几分钟后查看结果smartctl -l selftest /dev/sdX。对我来说,它看起来像这样:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     25151         734566647

我试图用本手册修复它们。那很有趣 :-)。我知道您已经检查了两个磁盘是否有错误,但是我认为您的问题是,仍位于md阵列中的磁盘具有读取错误,因此添加第二个磁盘失败。

更新资料

您应该另外运行一个smartctl -a /dev/sdX 如果您看到Current_Pending_Sector> 0,那是错误的

197 Current_Pending_Sector 0x0012 098098000 Always Old_age Always-69

对我来说,这绝对是一个问题,因为读取失败,我仅从RAID中取出磁盘用于测试和重新同步而无法完成。同步过程中途终止。当我检查仍在RAID阵列中的磁盘时,smartctl报告了问题。

我可以使用上面的手册进行修复,并看到待处理的扇区数减少了。但是有很多,而且这是一个漫长而无聊的过程,所以我使用了备份并将数据还原到另一台服务器上。

由于您没有使用SMART的机会,我想您的自检未显示那些损坏的扇区。

对我来说,这是一个教训:在从阵列中删除磁盘之前,请先检查磁盘。


在您回答问题时,RAID 1阵列已不复存在,并且发现两个驱动器都没有I / O错误。您可以验证您的答案是否适用?
Deltik 2012年

终于接受了。这个答案最有可能帮助未来的访客。我,我总体上放弃了RAID。这不像我拥有数据中心。
Deltik 2012年

这不再是公认的答案,但它仍然是一个不错的答案,可能会对其他人有所帮助。这个答案最适合我,但是这个答案可能最适合其他人。另外,我回想起我在此评论中对RAID所说的话
2015年

3

我有一个类似的问题,并通过将RAID阵列磁盘数量从1增加到2来解决。

mdadm --grow --raid-devices=2 /dev/md1

3

更新(2015年5月24日):三年后,我调查了RAID 1阵列降级的真正原因。

tl; dr: 其中一个驱动器坏了,我没有注意到这一点,因为我只在好的驱动器上进行了全面的表面测试。

三年前,我不希望检查任何有关I / O问题的日志。如果我想检查一下/var/log/syslog,当我mdadm放弃重建数组时会看到类似这样的内容:

May 24 14:08:32 node51 kernel: [51887.853786] sd 8:0:0:0: [sdi] Unhandled sense code
May 24 14:08:32 node51 kernel: [51887.853794] sd 8:0:0:0: [sdi]
May 24 14:08:32 node51 kernel: [51887.853798] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 24 14:08:32 node51 kernel: [51887.853802] sd 8:0:0:0: [sdi]
May 24 14:08:32 node51 kernel: [51887.853805] Sense Key : Medium Error [current]
May 24 14:08:32 node51 kernel: [51887.853812] sd 8:0:0:0: [sdi]
May 24 14:08:32 node51 kernel: [51887.853815] Add. Sense: Unrecovered read error
May 24 14:08:32 node51 kernel: [51887.853819] sd 8:0:0:0: [sdi] CDB:
May 24 14:08:32 node51 kernel: [51887.853822] Read(10): 28 00 00 1b 6e 00 00 00 01 00
May 24 14:08:32 node51 kernel: [51887.853836] end_request: critical medium error, dev sdi, sector 14381056
May 24 14:08:32 node51 kernel: [51887.853849] Buffer I/O error on device sdi, logical block 1797632

为了在日志中获得该输出,我使用以下命令查找了第一个有问题的LBA(在我的情况下为14381058):

root@node51 [~]# dd if=/dev/sdi of=/dev/zero bs=512 count=1 skip=14381058
dd: error reading ‘/dev/sdi’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 7.49287 s, 0.0 kB/s

难怪md放弃了!它无法从损坏的驱动器重建阵列。

新技术(更好的smartmontools硬件兼容性?)使我能够从驱动器中获取SMART信息,包括最近的五个错误(迄今为止的1393个错误):

root@node51 [~]# smartctl -a /dev/sdi
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-43-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 5K3000
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    ML2220FA040K9E
LU WWN Device Id: 5 000cca 36ac1d394
Firmware Version: ML6OA800
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5940 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun May 24 14:13:35 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (21438) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 358) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       93
  3 Spin_Up_Time            0x0007   172   172   024    Pre-fail  Always       -       277 (Average 362)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       174
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       22419
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       161
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       900
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       900
194 Temperature_Celsius     0x0002   127   127   000    Old_age   Always       -       47 (Min/Max 19/60)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       8
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       30
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2

SMART Error Log Version: 1
ATA Error Count: 1393 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1393 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 02 70 db 00  Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 00 70 db 40 00   1d+03:59:34.096  READ DMA EXT
  25 00 08 00 70 db 40 00   1d+03:59:30.334  READ DMA EXT
  b0 d5 01 09 4f c2 00 00   1d+03:57:59.057  SMART READ LOG
  b0 d5 01 06 4f c2 00 00   1d+03:57:58.766  SMART READ LOG
  b0 d5 01 01 4f c2 00 00   1d+03:57:58.476  SMART READ LOG

Error 1392 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 02 70 db 00  Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 00 70 db 40 00   1d+03:59:30.334  READ DMA EXT
  b0 d5 01 09 4f c2 00 00   1d+03:57:59.057  SMART READ LOG
  b0 d5 01 06 4f c2 00 00   1d+03:57:58.766  SMART READ LOG
  b0 d5 01 01 4f c2 00 00   1d+03:57:58.476  SMART READ LOG
  b0 d5 01 00 4f c2 00 00   1d+03:57:58.475  SMART READ LOG

Error 1391 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 02 70 db 00  Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 00 70 db 40 00   1d+03:56:28.228  READ DMA EXT
  25 00 08 00 70 db 40 00   1d+03:56:24.549  READ DMA EXT
  25 00 08 00 70 db 40 00   1d+03:56:06.711  READ DMA EXT
  25 00 10 f0 71 db 40 00   1d+03:56:06.711  READ DMA EXT
  25 00 f0 00 71 db 40 00   1d+03:56:06.710  READ DMA EXT

Error 1390 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 02 70 db 00  Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 00 70 db 40 00   1d+03:56:24.549  READ DMA EXT
  25 00 08 00 70 db 40 00   1d+03:56:06.711  READ DMA EXT
  25 00 10 f0 71 db 40 00   1d+03:56:06.711  READ DMA EXT
  25 00 f0 00 71 db 40 00   1d+03:56:06.710  READ DMA EXT
  25 00 10 f0 70 db 40 00   1d+03:56:06.687  READ DMA EXT

Error 1389 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 02 70 db 00  Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 00 70 db 40 00   1d+03:56:06.711  READ DMA EXT
  25 00 10 f0 71 db 40 00   1d+03:56:06.711  READ DMA EXT
  25 00 f0 00 71 db 40 00   1d+03:56:06.710  READ DMA EXT
  25 00 10 f0 70 db 40 00   1d+03:56:06.687  READ DMA EXT
  25 00 f0 00 70 db 40 00   1d+03:56:03.026  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     21249         14381058

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

嗯...就可以了。

现在,我通过三个简单的步骤解决了这个问题:

  1. 在三年内成为系统管理员。
  2. 检查日志。
  3. 回到超级用户并嘲笑我三年前的做法

更新(2015年7月19日):对于任何好奇的人,驱动器最终都用光了重新映射的扇区:

root@node51 [~]# smartctl -a /dev/sdg
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-43-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 5K3000
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    ML2220FA040K9E
LU WWN Device Id: 5 000cca 36ac1d394
Firmware Version: ML6OA800
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5940 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jul 19 14:00:33 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 117) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (21438) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 358) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   099   099   016    Pre-fail  Always       -       2
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       93
  3 Spin_Up_Time            0x0007   163   163   024    Pre-fail  Always       -       318 (Average 355)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       181
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1978
  7 Seek_Error_Rate         0x000b   086   086   067    Pre-fail  Always       -       1245192
  8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       23763
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       167
192 Power-Off_Retract_Count 0x0032   092   092   000    Old_age   Always       -       10251
193 Load_Cycle_Count        0x0012   092   092   000    Old_age   Always       -       10251
194 Temperature_Celsius     0x0002   111   111   000    Old_age   Always       -       54 (Min/Max 19/63)
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2927
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       33
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2

SMART Error Log Version: 1
ATA Error Count: 2240 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2240 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 f0 18 0f 2f 00  Error: IDNF 240 sectors at LBA = 0x002f0f18 = 3084056

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 f0 18 0f 2f 40 00      00:25:01.942  WRITE DMA EXT
  35 00 f0 28 0e 2f 40 00      00:25:01.168  WRITE DMA EXT
  35 00 f0 38 0d 2f 40 00      00:25:01.157  WRITE DMA EXT
  35 00 f0 48 0c 2f 40 00      00:25:01.147  WRITE DMA EXT
  35 00 f0 58 0b 2f 40 00      00:25:01.136  WRITE DMA EXT

Error 2239 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 5a 4e f7 2e 00  Error: IDNF 90 sectors at LBA = 0x002ef74e = 3077966

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 f0 b8 f6 2e 40 00      00:24:57.967  WRITE DMA EXT
  35 00 f0 c8 f5 2e 40 00      00:24:57.956  WRITE DMA EXT
  35 00 f0 d8 f4 2e 40 00      00:24:57.945  WRITE DMA EXT
  35 00 f0 e8 f3 2e 40 00      00:24:57.934  WRITE DMA EXT
  35 00 f0 f8 f2 2e 40 00      00:24:57.924  WRITE DMA EXT

Error 2238 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 40 a8 c6 2e 00  Error: IDNF 64 sectors at LBA = 0x002ec6a8 = 3065512

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 f0 f8 c5 2e 40 00      00:24:49.444  WRITE DMA EXT
  35 00 f0 08 c5 2e 40 00      00:24:49.433  WRITE DMA EXT
  35 00 f0 18 c4 2e 40 00      00:24:49.422  WRITE DMA EXT
  35 00 f0 28 c3 2e 40 00      00:24:49.412  WRITE DMA EXT
  35 00 f0 38 c2 2e 40 00      00:24:49.401  WRITE DMA EXT

Error 2237 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 ea be ba 2e 00  Error: IDNF 234 sectors at LBA = 0x002ebabe = 3062462

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 f0 b8 ba 2e 40 00      00:24:39.263  WRITE DMA EXT
  35 00 f0 c8 b9 2e 40 00      00:24:38.885  WRITE DMA EXT
  35 00 f0 d8 b8 2e 40 00      00:24:38.874  WRITE DMA EXT
  35 00 f0 e8 b7 2e 40 00      00:24:38.862  WRITE DMA EXT
  35 00 f0 f8 b6 2e 40 00      00:24:38.852  WRITE DMA EXT

Error 2236 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 86 c2 2a 2e 00  Error: IDNF 134 sectors at LBA = 0x002e2ac2 = 3025602

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 f0 58 2a 2e 40 00      00:24:25.605  WRITE DMA EXT
  35 00 f0 68 29 2e 40 00      00:24:25.594  WRITE DMA EXT
  35 00 f0 78 28 2e 40 00      00:24:25.583  WRITE DMA EXT
  35 00 f0 88 27 2e 40 00      00:24:25.572  WRITE DMA EXT
  35 00 f0 98 26 2e 40 00      00:24:25.561  WRITE DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short captive       Completed: read failure       50%     23763         869280
# 2  Extended offline    Completed without error       00%     22451         -
# 3  Short offline       Completed without error       00%     22439         -
# 4  Extended offline    Completed: read failure       90%     21249         14381058
1 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

1
是的,正是我的RAID发生了什么!这是您自己问题的实际答案!感谢您保持最新状态!!!
Preexo

1

就我而言,它也是错误的源磁盘。尽管它看起来好像不是这样的时间(/ proc / mdstat正常进展到99.9%以上-但实际上失败于99.97%,这与常规同步完成的时间一致)。因此,您需要检查dmesg(1)输出-它会告诉您是否存在任何读取错误。

您可以在Debian Bug#767243中查看我的案例的详细信息。我最终设法通过强制覆盖源磁盘上的几个坏扇区来完成同步(在我的情况下这些运气不好用,否则会丢失数据)


0

你可以试试

sudo mdadm --assemble --update=resync /dev/md0 /dev/sdb1 /dev/sdc1

更新驱动器并重新同步它们。


现在尝试一下...据称重建完成后,我会报告。
Deltik 2012年

没用 /dev/sdb1作为备用部件重建后,它仍然没有变为“活动”状态。
Deltik 2012年

0

由于您已经--add编辑了磁盘,但是--re-add似乎是您需要的选项,因此不确定是否可以使用。

还是您需要--grow将设备放入2个活动磁盘mdadm --grow -n 2?未经测试,所以要小心。


sudo mdadm --grow -n 2是我做的第一件事,因此这就是为什么要sudo mdadm --detail /dev/md0显示两个插槽的原因。对不起,它不起作用。
Deltik 2012年

0

我建议删除sdc1,将sdc1上的超级块清零,然后重新添加。

mdadm /dev/md0 -r /dev/sdc1
mdadm --zero-superblock /dev/sdc1
mdadm /dev/md0 -a /dev/sdc1

在将另一块硬盘上的超级块清零的同时,我已将数据转移到了每个硬盘上。即使完全恢复RAID 1阵列,我仍然遇到的问题。
Deltik 2012年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.