我有一堆Sun X2200-M2服务器。这些服务器具有ECC内存。
在其中一些服务器中,我在eLOM中收到有关“检测到可纠正的ECC错误”的警告,例如:
# ssh regress11 ipmitool sel elist
1 | 05/20/2010 | 14:20:27 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
2 | 05/20/2010 | 14:33:47 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
...比其他人更频繁。
该特定系统上的内核也会引发EDAC错误,尽管记录ecc事件的频率比eLOM要高得多:
EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x42a194, offset 0x60, grain 8, syndrome 0xf654, row 4, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error
EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x48cb94, offset 0x10, grain 8, syndrome 0xf654, row 5, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error
现在,如果服务器检测到不可纠正的ECC,则系统将重置,以至于很明显这很糟糕,卸下/更换已标识的存储棒或存储对可以解决此问题。
但是我在想,如果错误是可纠正的,那么就不会有直接的问题-我可以将其视为警告,如果开始出现不可纠正的错误,我准备拉紧棍子/对吗?