好吧,我将尝试彻底解释我的情况,所以请耐心等待。我现在的状态是这样的。我有一个500GB的硬盘,曾经在服务器中 - 它托管的虚拟机 - 并且是RAID 1中的两个驱动器之一。我假设这两个驱动器在这一点上是相同的,但是如果出于某种原因我有另一个驱动器这个不起作用,或者我需要使用它。
这个驱动器连接到运行Ubuntu Server 14.04 LTS的板载SATA(它是英特尔服务器MiniITX主板)上的小型Linux机箱,专为此目的而新安装。我已安装vmfs-tools
,提供访问权限vmfs-fuse
,我正在使用它来安装驱动器:
sudo vmfs-fuse /dev/sda1 /mnt/recovery
这可以成功地作为只读安装(注意/ dev / sdb是我的启动驱动器,它们被交换,因为我混淆了SATA端口)。我的fdisk如下:
taylor@nas:~$ sudo fdisk -l /dev/sda
WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sda1 1 975699967 487849983+ ee GPT
我可以成功读取包含我有问题的文件的文件夹的内容dot-flat.vmdk
:
taylor@nas:~$ sudo ls /mnt/recovery/dot
dot-flat.vmdk dot.vmdk dot.vmx vmware-1.log vmware-3.log vmware.log
dot.nvram dot.vmsd dot.vmxf vmware-2.log vmware-4.log
当然想要进行测试,以确保我能够正确读取这些文件,希望将内容物没有损坏,我想tail
荷兰国际集团vmware.log
:
taylor@nas:~$ sudo tail /mnt/recovery/dot/vmware.log
2014-12-01T09:19:46.553Z| vmx| I120: VMIOP: Exit
2014-12-01T09:19:46.696Z| vmx| I120: Vix: [35957 mainDispatch.c:849]: VMAutomation_LateShutdown()
2014-12-01T09:19:46.696Z| vmx| I120: Vix: [35957 mainDispatch.c:799]: VMAutomationCloseListenerSocket. Closing listener socket.
2014-12-01T09:19:46.715Z| vmx| I120: Flushing VMX VMDB connections
2014-12-01T09:19:46.715Z| vmx| I120: VmdbDbRemoveCnx: Removing Cnx from Db for '/db/connection/#1/'
2014-12-01T09:19:46.715Z| vmx| I120: VmdbCnxDisconnect: Disconnect: closed pipe for pub cnx '/db/connection/#1/' (0)
2014-12-01T09:19:46.721Z| vmx| I120: VMX exit (0).
2014-12-01T09:19:46.721Z| vmx| I120: AIOMGR-S : stat o=1 r=3 w=0 i=0 br=49152 bw=0
2014-12-01T09:19:46.721Z| vmx| I120: OBJLIB-LIB: ObjLib cleanup done.
2014-12-01T09:19:46.721Z| vmx| W110: VMX has left the building: 0.
所以这很好,它可能不是驱动器问题。无论哪种方式,我想检查SMART数据。在我拉动它们进行升级之前,驱动器工作正常。我安装smartmontools
然后:
taylor@nas:~$ sudo smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE3 Serial ATA
Device Model: WDC WD5002ABYS-18B1B0
Serial Number: WD-WCASY4933732
LU WWN Device Id: 5 0014ee 202b597b2
Add. Product Id: DELL�
Firmware Version: 02.03B04
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.5, 3.0 Gb/s
Local Time is: Sun Dec 7 01:55:02 2014 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 9480) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 112) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 194 185 021 Pre-fail Always - 3291
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 196
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 056 056 000 Old_age Always - 32738
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 106
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 99
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 196
194 Temperature_Celsius 0x0022 108 105 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 32447 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
至少对我来说,这看起来像一个非常直接的SMART报告,适用于旧驱动器,特别是对于具有32638小时(1352天!)的开机时间。我之前在另一个驱动器(raid对)上运行报告,结果非常相似,如果我没记错的话。
有问题的驱动器包含大约8个虚拟机,所有这些虚拟机都没有问题。为此,我使用了一个简单的cp
命令来驱动不同的驱动器。该目标驱动器与运行操作系统相同,大约有700GB可用空间。问题的出现曾经cp
达到了所有这些问题中最大的(大幅度)VMDK文件。大多数VMDK大约是25-30GB,而有问题的大约是300GB。大型VMDK最初是作为THICK VMDK创建的。这是做什么的cp
:
taylor@nas:~$ sudo cp /mnt/recovery/dot/dot-flat.vmdk ~/dot-flat.vmdk
cp: error reading ‘/mnt/recovery/dot/dot-flat.vmdk’: Input/output error
cp: failed to extend ‘/home/taylor/dot-flat.vmdk’: Input/output error
我读到的关于Input/output error
这笔交易的一切都意味着驱动器失败了。但是我在两个驱动器上都遇到了同样的问题,SMART测试似乎没问题,所以我认为它可能是其他的东西。文件大小也可能是一个因素。
所以,我决定尝试rsync
,因为一个点对点的副本可能更适合我。这个有点陌生。起初它看起来rsync
工作得很好,我可以ls -al
在目标目录中看到临时文件大小稳定增加。但是,一旦目标文件达到适当的大小,它Input/output error
就像之前一样显示说明,然后重新开始该rsync
过程,删除它刚刚传输的文件(或至少部分)。谈论沮丧。这是输出的样子:
好一点,好一点:
taylor@nas:~$ sudo rsync -av --progress /mnt/recovery/dot/dot-flat.vmdk ~/dot-flat.vmdk
sending incremental file list
dot-flat.vmdk
201,769,451,520 62% 95.50MB/s 0:20:30
完成后:
taylor@nas:~$ sudo rsync -av --progress /mnt/recovery/dot/dot-flat.vmdk ~/dot-flat.vmdk
sending incremental file list
dot-flat.vmdk
322,122,547,200 100% 81.94MB/s 1:02:29 (xfr#1, to-chk=0/1)
rsync: read errors mapping "/mnt/recovery/dot/dot-flat.vmdk": Input/output error (5)
WARNING: dot-flat.vmdk failed verification -- update discarded (will try again).
dot-flat.vmdk
672,759,808 0% 85.96MB/s 1:00:52
真的,最后我是在VMDK内部的一些文件之后。如果有办法直接安装VMDK,我很想知道,但我在网上看到的关于这个主题的一切都行不通,大多数是因为我有一个VMFS卷而不是像EXT4那样更直接的东西。我想可能有一些解决方法,但我不太确定
我想我可以尝试在服务器中弹回两个驱动器,重新创建我的ESXi VM并将数据拉出来吗?不。他们所在的服务器是带有SAS 6i / R RAID控制器的Dell Poweredge 1950,它已经有一个不同的阵列。如果我愿意的话,我无法将驱动器重新插入并在ESXi中加载它们,而不至少格式化它们。
所以这就是我转向SuperUser的地方。关于我能做什么的任何建议?任何类型的替代复制实用程序?在给定驱动器的情况下安装VMDK的方法格式为VMFS?修复输入/输出错误?甚至可能手动拆开VMDK?我在图像本身上只有一个分区,所以很容易猜到给定的虚拟EXT4分区的起始位置,但我不知道VMDK文件的结构如何。
我很感激你的阅读时间!