如何确定系统崩溃的原因？

10

我的服务器大约每周崩溃一次，并且没有任何原因的提示。我检查了一下/var/log/messages，当我执行硬重启时，它仅在某些时候停止记录并在计算机上发布信息。

我可以检查一些东西或安装可以确定原因的软件吗？

我正在运行CentOS 7。

这是我唯一的错误/问题/var/log/dmesg：https : //paste.netcoding.net/cosisiloji.log

[    3.606936] md: Waiting for all devices to be available before autodetect
[    3.606984] md: If you don't use raid, use raid=noautodetect
[    3.607085] md: Autodetecting RAID arrays.
[    3.608309] md: Scanned 6 and added 6 devices.
[    3.608362] md: autorun ...
[    3.608412] md: considering sdc2 ...
[    3.608464] md:  adding sdc2 ...
[    3.608516] md: sdc1 has different UUID to sdc2
[    3.608570] md:  adding sdb2 ...
[    3.608620] md: sdb1 has different UUID to sdc2
[    3.608674] md:  adding sda2 ...
[    3.608726] md: sda1 has different UUID to sdc2
[    3.608944] md: created md2
[    3.608997] md: bind<sda2>
[    3.609058] md: bind<sdb2>
[    3.609116] md: bind<sdc2>
[    3.609175] md: running: <sdc2><sdb2><sda2>
[    3.609548] md/raid1:md2: active with 3 out of 3 mirrors
[    3.609623] md2: detected capacity change from 0 to 98520989696
[    3.609685] md: considering sdc1 ...
[    3.609737] md:  adding sdc1 ...
[    3.609789] md:  adding sdb1 ...
[    3.609841] md:  adding sda1 ...
[    3.610005] md: created md1
[    3.610055] md: bind<sda1>
[    3.610117] md: bind<sdb1>
[    3.610175] md: bind<sdc1>
[    3.610233] md: running: <sdc1><sdb1><sda1>
[    3.610714] md/raid1:md1: not clean -- starting background reconstruction
[    3.610773] md/raid1:md1: active with 3 out of 3 mirrors
[    3.610854] md1: detected capacity change from 0 to 20970405888
[    3.610917] md: ... autorun DONE.
[    3.610999] md: resync of RAID array md1
[    3.611054] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[    3.611119] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[    3.611180] md: using 128k window, over a total of 20478912k.
[    3.611244]  md1: unknown partition table
[    3.624786] EXT3-fs (md1): error: couldn't mount because of unsupported optional features (240)
[    3.627095] EXT2-fs (md1): error: couldn't mount because of unsupported optional features (244)
[    3.630284] EXT4-fs (md1): INFO: recovery required on readonly filesystem
[    3.630341] EXT4-fs (md1): write access will be enabled during recovery
[    3.819411] EXT4-fs (md1): orphan cleanup on readonly fs
[    3.836922] EXT4-fs (md1): 24 orphan inodes deleted
[    3.836975] EXT4-fs (md1): recovery complete
[    3.840557] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: (null)

linux centos server-crashes

— 布莱恩·格雷厄姆
source

6

如果已crashkernel/kdump安装并启用该工具，则应该能够使用该crash实用程序相对容易地检查崩溃的内核。例如，假定崩溃的内核转储保存在/var/crash：crash /var/crash/2009-07-17-10\:36/vmcore /usr/lib/debug/lib/modules/uname -r下/vmlinux。

给看看这里，并在这里进行添加细节。

— Shodanshok
source

我已经修复了/dev/md1 not found在运行grub2-probe和安装并配置了crashkernel / kdump 时的错误，如果/当它再次崩溃时，它将报告。

— Brian Graham

5

您可以在处检查dmesg文件/var/log/dmesg，该文件正在记录内核消息。消息日志仅记录服务和应用程序消息，如果您遇到内核错误，则服务和应用程序将停止运行，但是内核错误仍记录在dmesg中。

— 太阴
source

我检查了dmesg和dmesg.old，它们都只包含启动信息（约4.8秒）。我可以看到的唯一“问题”是启动磁盘或RAID驱动器似乎有问题，但是系统会对其进行修复并可以正常运行。检查主要帖子的链接。

— Brian Graham

2

BIOS记忆力测试
BIOS硬盘测试
检查智能驱动器日志 smartctl /dev/sda -a
智能驾驶测试
离开dmesg -wH窗口

— 吉姆
source

我已经对所有3个驱动器进行了智能驱动器测试，它们没有损坏。我已经dmesg -wH在一个窗口中运行（我假设直到它再次崩溃；并且在SSH崩溃后仍可以读取输出）。我没有物理访问计算机的权限，我是否可以要求主机运行BIOS内存和硬盘驱动器测试？

— Brian Graham