我有一个由6个RAIDZ组成的ZFS池。RAIDZ中的一个降级了,这是因为单个RAIDZ中的两个磁盘之间的松动距离足够近,以至于ZFS无法在第二个磁盘发生故障之前从第一个故障中恢复。这是重启后不久“ zpool status”的输出:
pool: pod2
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver in progress for 0h6m, 0.05% done, 237h17m to go
config:
NAME STATE READ WRITE CKSUM
pod2 DEGRADED 0 0 29.3K
raidz1-0 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F165XG ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F1660X ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F1678R ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F1689F ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F16AW9 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F16C6E ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F16C9F ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F16FCD ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F16JDQ ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F17M6V ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F17MSZ ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F17MXE ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F17XKB ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F17XMW ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F17ZHY ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F18BM4 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F18BRF ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_W1F18XLP ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09880 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F098BE ONLINE 0 0 0
raidz1-4 DEGRADED 0 0 58.7K
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09B0M ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BEN UNAVAIL 0 0 0 cannot open
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F49M01 ONLINE 0 0 0 837K resilvered
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0D6LC ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWD1 ONLINE 0 0 0
spare-4 DEGRADED 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09C8G UNAVAIL 0 0 0 cannot open
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F4A7ZE ONLINE 0 0 0 830K resilvered
raidz1-5 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-1CH_Z1F2KNQP ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BML0 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPV4 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPZP ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ78 ONLINE 0 0 0
raidz1-6 ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ9G ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQDF ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQFQ ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CW1A ONLINE 0 0 0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BV7M ONLINE 0 0 0
spares
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F49M01 INUSE currently in use
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F4A7ZE INUSE currently in use
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F49MB1 AVAIL
disk/by-id/scsi-SATA_ST3000DM001-1ER_Z5001SS2 AVAIL
disk/by-id/scsi-SATA_ST3000DM001-1ER_Z5001R0F AVAIL
errors: 37062187 data errors, use '-v' for a list
当第一个磁盘发生故障时,我用一个热备用磁盘替换了它,然后它开始恢复银色。重新完成备份之前,第二个磁盘发生了故障,因此我用另一个热备用磁盘替换了第二个磁盘。从那时起,它将开始重新启动,完成大约50%的工作,然后开始吞噬内存,直到它耗尽所有内存并导致操作系统崩溃。
目前,升级服务器上的RAM并不是一个简单的选择,而且我不清楚这样做是否可以保证解决方案。我知道在此阶段会丢失数据,但是如果我可以牺牲这个RAIDZ的内容来保留池的其余部分,那是完全可以接受的结果。我正在将该服务器的内容备份到另一台服务器,但是内存消耗问题每48小时左右会强制重新引导(或崩溃),这会中断我的rsync备份,并且重新启动rsync会花费一些时间(它可以一旦弄清它停在哪里,就会恢复执行,但这需要很长时间。
我认为ZFS尝试处理两个备用替换操作是内存消耗问题的根源,因此我想删除其中一个热备用,以便ZFS一次可以处理一个。但是,当我尝试分离其中一个备用磁盘时,出现“无法分离/ dev / disk / by-id / scsi-SATA_ST3000DM001-1CH_W1F49M01:无有效副本”的信息。也许我可以使用-f选项来强制执行该操作,但是我不清楚这样做的结果是什么,因此我想在继续之前查看是否有人有任何输入。
如果我可以使系统进入稳定状态,并且该系统可以保持足够长的运行时间以完成备份,那么我计划将其卸下进行检修,但是在当前条件下,它会陷入恢复循环中。
zfs-fuse
。这真的是 ZFS保险丝吗?请提供操作系统详细信息。