MC0,第2行和通道0很重要。尝试更换CPU0上的DIMMA1。
举例来说,我必须在Linux服务器中找出一个坏的DIMM,该服务器具有16个完全填充的DIMM插槽和两个CPU。这些是我在控制台上看到的错误:
EDAC k8 MC1: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
EDAC MC1: CE page 0x103ca78, offset 0xf88, grain 8, syndrome 0x9f65, row 1, channel 0, label "": k8_edac
EDAC MC1: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC1: extended error code: ECC chipkill x4 error
我服务器中的DIMM错误是CPU1上的DIMMA0。
EDAC代表错误检测和纠正,并在http://www.kernel.org/doc/Documentation/edac.txt和/usr/share/doc/kernel-doc-2.6*/Documentation/drivers/edac/edac中进行了记录我的系统(RHEL5)上的.txt。CE代表“可纠正的错误”,并且如文档所示,“ CE提供DIMM即将失效的早期迹象。”
回到上面在服务器控制台上看到的EDAC错误,MC1(内存控制器1)表示CPU1,在Linux EDAC文档中将行1称为csrow1(芯片选择行1),通道0表示内存通道0我检查了http://www.kernel.org/doc/Documentation/edac.txt上的图表,发现csrow1和Channel 0对应于DIMM_A0(我系统上的DIMMA0):
Channel 0 Channel 1
===================================
csrow0 | DIMM_A0 | DIMM_B0 |
csrow1 | DIMM_A0 | DIMM_B0 |
===================================
===================================
csrow2 | DIMM_A1 | DIMM_B1 |
csrow3 | DIMM_A1 | DIMM_B1 |
===================================
(作为另一个示例,如果我在MC0,csrow4和通道1上看到错误,则应该更换CPU0上的DIMMB2。)
当然,我的服务器上实际上有两个DIMM插槽,称为DIMMA0(每个CPU一个),但是MC1错误再次对应于CPU1,它在dmidecode输出中的“ Bank Locator”下列出:
[root@rce-8 ~]# dmidecode -t memory | grep DIMMA0 -B9 -A8
Handle 0x002E, DMI type 17, 27 bytes.
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: DIMMA0
Bank Locator: CPU0
Type: DDR2
Type Detail: Synchronous
Speed: 533 MHz (1.9 ns)
Manufacturer:
Serial Number:
Asset Tag:
Part Number:
--
Handle 0x003E, DMI type 17, 27 bytes.
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: DIMMA0
Bank Locator: CPU1
Type: DDR2
Type Detail: Synchronous
Speed: 533 MHz (1.9 ns)
Manufacturer:
Serial Number:
Asset Tag:
Part Number:
(在我的工作站上,dmidecode实际上显示了我的DIMM的部件号和序列号,这非常有用。)
除了查看控制台和日志中的错误外,您还可以通过检查/ sys / devices / system / edac来查看每个MC / CPU,行/ csrow和通道的错误。在我的情况下,错误仅发生在MC1,csrow1,通道0:
[root@rce-8 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow7/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow7/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:6941652
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch1_ce_count:0
我希望该示例对任何试图基于EDAC错误识别出错误的DIMM的人有所帮助。有关更多信息,我强烈建议阅读http://www.kernel.org/doc/Documentation/edac.txt上的所有Linux EDAC文档。