如何诊断频繁的段错误


10

我的服务器使用其他工具将频繁的分段错误记录到/var/log/kern.log中。到目前为止,我已经在Perl,PHP和rsync中看到了它们。所有已安装的软件都是最新的Debian软件包。这是日志文件的摘录:

Mar  2 01:07:54 gaz kernel: [ 5316.246303] imapsync[4533]: segfault at 8b ip 00007fb448c98fe6 sp 00007ffff571dd68 error 4 in libperl.so.5.10.1[7fb448bd7000+164000]
Mar  2 01:17:42 gaz kernel: [ 5904.354307] php5-cgi[4441]: segfault at 2bb3dc8 ip 0000000002bb3dc8 sp 00007fffbeeaae48 error 15
Mar  2 02:54:05 gaz kernel: [11687.922316] php5-cgi[4495]: segfault at 2d7acf9 ip 0000000002d7acf9 sp 00007fff60c6eb18 error 15
Mar  2 10:50:08 gaz kernel: [40250.390322] BUG: unable to handle kernel paging request at 00000000024b03f0
Mar  2 10:50:08 gaz kernel: [40250.390341] IP: [<00000000024b03f0>] 0x24b03f0
Mar  2 10:50:08 gaz kernel: [40250.390353] PGD 208c71067 PUD 21c811067 PMD 209329067 PTE 8000000211c88067
Mar  2 10:50:08 gaz kernel: [40250.390365] Oops: 0011 [#1] SMP 
Mar  2 10:50:08 gaz kernel: [40250.390373] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat
Mar  2 10:50:08 gaz kernel: [40250.390386] CPU 1 
Mar  2 10:50:08 gaz kernel: [40250.390392] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_
ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd
_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_m
ce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod
 sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan]
Mar  2 10:50:08 gaz kernel: [40250.390566] Pid: 11482, comm: munin-limits Not tainted 2.6.32-5-amd64 #1 MS-7368
Mar  2 10:50:08 gaz kernel: [40250.390576] RIP: 0010:[<00000000024b03f0>]  [<00000000024b03f0>] 0x24b03f0
Mar  2 10:50:08 gaz kernel: [40250.390586] RSP: 0018:ffff88021cc8dec0  EFLAGS: 00010286
Mar  2 10:50:08 gaz kernel: [40250.390593] RAX: 000000001ddc1000 RBX: 0000000000000010 RCX: ffffffff810f9904
Mar  2 10:50:08 gaz kernel: [40250.390600] RDX: 0000000000000000 RSI: ffffea0007688200 RDI: 0000000000000286
Mar  2 10:50:08 gaz kernel: [40250.390608] RBP: 00000000ffffffea R08: 0000000000000025 R09: 7865542f30312e35
Mar  2 10:50:08 gaz kernel: [40250.390615] R10: 000000d01cc8ddf8 R11: 0000000000000246 R12: ffff88021cc8def8
Mar  2 10:50:08 gaz kernel: [40250.390622] R13: 0000000002295010 R14: 00000000022c9db0 R15: 0000000002488d78
Mar  2 10:50:08 gaz kernel: [40250.390630] FS:  00007f3b3c8b2700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000
Mar  2 10:50:08 gaz kernel: [40250.390641] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  2 10:50:08 gaz kernel: [40250.390648] CR2: 00000000024b03f0 CR3: 000000021c5d1000 CR4: 00000000000006e0
Mar  2 10:50:08 gaz kernel: [40250.390656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar  2 10:50:08 gaz kernel: [40250.390663] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar  2 10:50:08 gaz kernel: [40250.390671] Process munin-limits (pid: 11482, threadinfo ffff88021cc8c000, task ffff88021bf59530)
Mar  2 10:50:08 gaz kernel: [40250.390681] Stack:
Mar  2 10:50:08 gaz kernel: [40250.390687]  ffffffff810f1d4a ffff880208c63228 0000000000000000 00007fffc2dcecc0
Mar  2 10:50:08 gaz kernel: [40250.390697] <0> 00000000024ba2b0 0000000002295010 ffffffff810f1e3d 0000000000000004
Mar  2 10:50:08 gaz kernel: [40250.390712] <0> ffff88021bf59530 ffff88021c4edc00 ffffffff812fe0b6 ffff88021c4edc60
Mar  2 10:50:08 gaz kernel: [40250.390732] Call Trace:
Mar  2 10:50:08 gaz kernel: [40250.390742]  [<ffffffff810f1d4a>] ? vfs_fstatat+0x2c/0x57
Mar  2 10:50:08 gaz kernel: [40250.390750]  [<ffffffff810f1e3d>] ? sys_newstat+0x11/0x30
Mar  2 10:50:08 gaz kernel: [40250.390760]  [<ffffffff812fe0b6>] ? do_page_fault+0x2e0/0x2fc
Mar  2 10:50:08 gaz kernel: [40250.390768]  [<ffffffff812fbf55>] ? page_fault+0x25/0x30
Mar  2 10:50:08 gaz kernel: [40250.390777]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Mar  2 10:50:08 gaz kernel: [40250.390783] Code:  Bad RIP value.
Mar  2 10:50:08 gaz kernel: [40250.390791] RIP  [<00000000024b03f0>] 0x24b03f0
Mar  2 10:50:08 gaz kernel: [40250.390799]  RSP <ffff88021cc8dec0>
Mar  2 10:50:08 gaz kernel: [40250.390805] CR2: 00000000024b03f0
Mar  2 10:50:08 gaz kernel: [40250.391051] ---[ end trace 1cc1473b539c7f6e ]---
Mar  2 11:42:20 gaz kernel: [43382.242301] php5-cgi[10963]: segfault at d81160 ip 0000000000d81160 sp 00007fff3adcb058 error 15
Mar  2 21:51:14 gaz kernel: [79916.418302] php5-cgi[20089]: segfault at 1c59dc8 ip 0000000001c59dc8 sp 00007fff9b877fb8 error 15
Mar  3 03:45:01 gaz kernel: [101143.334305] munin-update[22519] general protection ip:7f516dce204c sp:7fff6049a978 error:0 in libperl.so.5.10.1[7f516dc7d000+164000]
Mar  3 11:22:37 gaz kernel: [128599.570307] php5-cgi[22888]: segfault at 36485a8 ip 00000000036485a8 sp 00007fff2d56e1c8 error 15
Mar  4 08:32:17 gaz kernel: [204779.842304] php5-cgi[22090]: segfault at 18 ip 0000000000689e5e sp 00007fff677a6a48 error 6 in php5-cgi[400000+6f9000]
Mar  4 10:01:02 gaz kernel: [210104.434706] rsync[22236] general protection ip:7f14a07137f9 sp:7fff88f940b8 error:0 in libc-2.11.2.so[7f14a069d000+158000]
Mar  4 11:32:22 gaz kernel: [215584.262316] BUG: unable to handle kernel paging request at 00000000ffffff9c
Mar  4 11:32:22 gaz kernel: [215584.262331] IP: [<00000000ffffff9c>] 0xffffff9c

Mar  4 11:32:22 gaz kernel: [215584.262343] PGD 0 
Mar  4 11:32:22 gaz kernel: [215584.262350] Oops: 0010 [#2] SMP 
Mar  4 11:32:22 gaz kernel: [215584.262359] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat
Mar  4 11:32:22 gaz kernel: [215584.262371] CPU 1 
Mar  4 11:32:22 gaz kernel: [215584.262378] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_mce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan]
Mar  4 11:32:22 gaz kernel: [215584.262552] Pid: 1960, comm: proxymap Tainted: G      D    2.6.32-5-amd64 #1 MS-7368
Mar  4 11:32:22 gaz kernel: [215584.262563] RIP: 0010:[<00000000ffffff9c>]  [<00000000ffffff9c>] 0xffffff9c
Mar  4 11:32:22 gaz kernel: [215584.262573] RSP: 0018:ffff880209257e00  EFLAGS: 00010212
Mar  4 11:32:22 gaz kernel: [215584.262580] RAX: ffff8801514eb780 RBX: ffffffff810efb2d RCX: 0000000000000000
Mar  4 11:32:22 gaz kernel: [215584.262590] RDX: 0000000000000020 RSI: 0000000000000001 RDI: ffff8801514eb780
Mar  4 11:32:22 gaz kernel: [215584.262600] RBP: 00000000ffffffe9 R08: 0000000000000000 R09: 0000000000000000
Mar  4 11:32:22 gaz kernel: [215584.262611] R10: ffff880209257e78 R11: ffffffff81152c7c R12: 0000000000000001
Mar  4 11:32:22 gaz kernel: [215584.262622] R13: 0000000000008001 R14: 0000000000000024 R15: 00000000ffffff9c
Mar  4 11:32:22 gaz kernel: [215584.262633] FS:  00007fca4de35700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000
Mar  4 11:32:22 gaz kernel: [215584.262644] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  4 11:32:22 gaz kernel: [215584.262650] CR2: 00000000ffffff9c CR3: 00000001c9cbb000 CR4: 00000000000006e0
Mar  4 11:32:22 gaz kernel: [215584.262661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar  4 11:32:22 gaz kernel: [215584.262671] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar  4 11:32:22 gaz kernel: [215584.262682] Process proxymap (pid: 1960, threadinfo ffff880209256000, task ffff88021c4b1c40)
Mar  4 11:32:22 gaz kernel: [215584.262693] Stack:
Mar  4 11:32:22 gaz kernel: [215584.262698]  ffffffff810f8566 ffff880209257e78 ffff88021c7bf000 ffff88021c7bf0c8
Mar  4 11:32:22 gaz kernel: [215584.262709] <0> 0000800000000000 ffff88021fc0f000 ffff880209257e78 00000000fffffffe
Mar  4 11:32:22 gaz kernel: [215584.262724] <0> ffffffff810e5881 ffff880209257f48 0000000000000286 ffff88021fc0f000
Mar  4 11:32:22 gaz kernel: [215584.262743] Call Trace:
Mar  4 11:32:22 gaz kernel: [215584.262753]  [<ffffffff810f8566>] ? do_filp_open+0xa7/0x94b
Mar  4 11:32:22 gaz kernel: [215584.262763]  [<ffffffff810e5881>] ? virt_to_head_page+0x9/0x2a
Mar  4 11:32:22 gaz kernel: [215584.262771]  [<ffffffff810f9904>] ? user_path_at+0x52/0x79
Mar  4 11:32:22 gaz kernel: [215584.262779]  [<ffffffff810cfec1>] ? get_unmapped_area+0xd7/0x139
Mar  4 11:32:22 gaz kernel: [215584.262787]  [<ffffffff811019d5>] ? alloc_fd+0x67/0x10c
Mar  4 11:32:22 gaz kernel: [215584.262795]  [<ffffffff810eceaf>] ? do_sys_open+0x55/0xfc
Mar  4 11:32:22 gaz kernel: [215584.262804]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Mar  4 11:32:22 gaz kernel: [215584.262811] Code:  Bad RIP value.
Mar  4 11:32:22 gaz kernel: [215584.262819] RIP  [<00000000ffffff9c>] 0xffffff9c
Mar  4 11:32:22 gaz kernel: [215584.262828]  RSP <ffff880209257e00>
Mar  4 11:32:22 gaz kernel: [215584.262833] CR2: 00000000ffffff9c
Mar  4 11:32:22 gaz kernel: [215584.263077] ---[ end trace 1cc1473b539c7f6f ]---

如您所见,存在段错误,常规保护错误和内核异常。我的第一个猜测是存在某种硬件问题,我让托管人(这是租用的根服务器)进行了完整的硬件检查-他们确实做到了,但是找不到任何问题。

我不知道他们如何检查以及如何检查,但是他们的支持团队通常都很好。我自己运行了memtester和cpuburn,也找不到任何错误。

不幸的是,我没有可靠的方法来复制这些段错误,它们似乎或多或少是随机的。出于预感,我禁用了系统的防火墙并运行了定期进行段隔离(imapsync)的程序,并且段隔离似乎比以前花费了更长的时间,因此问题可能与网络堆栈有关。或者可能只是随机的事情。

以下是内核规格:

# uname -a
Linux gaz 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux
# cat /etc/debian_version 
6.0
# lsmod
Module                  Size  Used by
cpufreq_userspace       1992  0 
cpufreq_stats           2659  0 
cpufreq_powersave        902  0 
cpufreq_conservative     5162  0 
xt_recent               5977  0 
xt_tcpudp               2319  0 
iptable_nat             4299  0 
nf_nat                 13388  1 iptable_nat
nf_conntrack_ipv4       9833  3 iptable_nat,nf_nat
nf_defrag_ipv4          1139  1 nf_conntrack_ipv4
ip6table_filter         2384  0 
ip6_tables             15075  1 ip6table_filter
xt_DSCP                 1995  0 
xt_TCPMSS               2919  0 
ipt_LOG                 4518  0 
ipt_REJECT              1953  0 
iptable_mangle          2817  0 
iptable_filter          2258  0 
xt_multiport            2267  0 
xt_state                1303  0 
xt_limit                1782  0 
xt_conntrack            2407  0 
nf_conntrack_ftp        5537  0 
nf_conntrack           46535  6 iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state,xt_conntrack,nf_conntrack_ftp
ip_tables              13899  3 iptable_nat,iptable_mangle,iptable_filter
x_tables               12845  13 xt_recent,xt_tcpudp,iptable_nat,ip6_tables,xt_DSCP,xt_TCPMSS,ipt_LOG,ipt_REJECT,xt_multiport,xt_state,xt_limit,xt_conntrack,ip_tables
loop                   11799  0 
radeon                573996  0 
ttm                    39986  1 radeon
drm_kms_helper         20065  1 radeon
snd_hda_codec_atihdmi     2251  1 
drm                   142359  3 radeon,ttm,drm_kms_helper
snd_hda_intel          20019  0 
i2c_algo_bit            4225  1 radeon
pcspkr                  1699  0 
i2c_piix4               8328  0 
snd_hda_codec          54244  2 snd_hda_codec_atihdmi,snd_hda_intel
i2c_core               15712  5 radeon,drm_kms_helper,drm,i2c_algo_bit,i2c_piix4
snd_hwdep               5380  1 snd_hda_codec
snd_pcm                60503  2 snd_hda_intel,snd_hda_codec
snd_timer              15582  1 snd_pcm
snd                    46446  5 snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer
soundcore               4598  1 snd
evdev                   7352  3 
snd_page_alloc          6249  2 snd_hda_intel,snd_pcm
k8temp                  3283  0 
edac_core              29261  0 
edac_mce_amd            6433  0 
shpchp                 26264  0 
pci_hotplug            21203  1 shpchp
button                  4650  0 
ext3                  106518  2 
jbd                    37085  1 ext3
mbcache                 5050  1 ext3

dm_mod                 53754  0 
powernow_k8            10978  1 
aacraid                59779  0 
3w_9xxx                28684  0 
3w_xxxx                20569  0 
raid10                 17809  0 
raid456                44500  0 
async_raid6_recov       5170  1 raid456
async_pq                3479  2 raid456,async_raid6_recov
raid6_pq               77179  2 async_raid6_recov,async_pq
async_xor               2478  3 raid456,async_raid6_recov,async_pq
xor                     4380  1 async_xor
async_memcpy            1198  2 raid456,async_raid6_recov
async_tx                1734  5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy
raid1                  18431  3 
raid0                   5517  0 
md_mod                 73824  7 raid10,raid456,raid1,raid0
sata_nv                19166  0 
sata_sil                7412  0 
sata_via                7928  0 
sd_mod                 29889  8 
crc_t10dif              1276  1 sd_mod
ata_generic             3047  0 
ahci                   32374  6 
r8169                  29229  0 
mii                     3210  1 r8169
thermal                11674  0 
pata_atiixp             3489  0 
libata                133632  6 sata_nv,sata_sil,sata_via,ata_generic,ahci,pata_atiixp
ohci_hcd               19212  0 
ehci_hcd               31151  0 
processor              29935  1 powernow_k8
thermal_sys            11942  2 thermal,processor
scsi_mod              122149  5 aacraid,3w_9xxx,3w_xxxx,sd_mod,libata
usbcore               122034  3 ohci_hcd,ehci_hcd
nls_base                6377  1 usbcore
# free 
             total       used       free     shared    buffers     cached
Mem:       8166128    1228036    6938092          0     140412     782060
-/+ buffers/cache:     305564    7860564
Swap:      2102456          0    2102456

所以,基本上我的问题是:

  1. 如何进一步诊断?
  2. 上面的日志中是否有任何数据可以帮助我隔离麻烦制造者?
  3. 谷歌搜索时,上述硬件/软件是否存在任何已知问题?
  4. 有没有一种方法可以防止内核自动加载模块(我可能不需要所有这些模块,其中一个可能是罪魁祸首)

Answers:


5

检查你的记忆!

像这样的随机段错误最常见的原因是记忆不良。抓住一个内存检查器(例如memtest86 +)并对其进行测试。


我的机器在4天内发生了15次段错误。其中之一systemd!我曾经运行过memtest86。所有四遍都没有发现记忆问题。一定是其他东西
克里斯·史密斯

1

开始的事情...检查服务器有多少内存。检查交换分区的大小。检查其他日志文件以获取潜在的信息源(系统日志)。检查内核版本和当前硬件(或虚拟化系统)是否存在已知问题。我在小型(vmware)vm中使用该内核运行Debian 6,没有问题。


我在上面(在lsmod输出下面)添加了内存信息。我不使用任何虚拟化,而是直接在硬件(AMD Athlon(tm)64 X2双核处理器6000+)上使用的标准Debian。其他日志没有我能看到的任何有用信息(除了Apache抱怨他的快速cgi的段错误)
Andreas Gohr 2011年

1
如果添加了新内存,则可能是新内存不喜欢另一内存或相反的情况。我知道这听起来很有趣,但有时两件硬件彼此不喜欢。
Radek

0

我要检查的一件事是您的托管服务提供商是否使用了所谓的“突发性RAM”。廉价的主机具有一些基本的RAM是很常见的,可以临时扩展这些RAM。临时扩展的RAM的问题在于您不能依赖它,因为它可能在计算过程中被拿走,从而导致段错误。


1
首先,问题中有足够的信息可以得出结论,这不是正在发生的事情。除了您所描述的内容之外,还很愚蠢,以至于没有任何证据,我简直不敢相信任何东西都不会遇到实现如此无用的东西的麻烦。
kasperd

我希望自己是在开玩笑,但Burstable RAM是一件实事:webhostingtalk.com/…但是,重新阅读问题后,我可以看到这是不太可能的原因。
Programagor

该链接没有说明分段错误。
卡巴斯德(Kasperd),

没错,可扩展RAM通常会导致OOM被杀死,而不是段错误。
Programagor
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.