如何使用kdump / crash调查OOM问题?


12

问题

多次“内存不足”消息后,服务器崩溃了,我试图查明罪魁祸首。如果在用户区中-哪个进程。如果在内核中-哪个内核模块。

细节

我正在尝试找出如何使用崩溃实用工具来调查是什么触发了服务器上的OOM。

作为安装新服务器对的一部分,我开始了14TB DRBD设备的初始化。大约那个时候,在使用DRBD同步器速率配置并上下移动某些绑定网络接口时,其中一台服务器崩溃了。在30秒内,它产生了39条Out of memory: Kill process ####消息。然后它崩溃了:

Kernel panic - not syncing: Out of memory and no killable processes...

系统崩溃触发了一个kdump。现在,我有一个不错的vmcore.flat文件,应该可以直接使用它来调查问题,但是我很难找出所有内存的去向。

我知道的唯一资源是Dedoimedo的站点(该站点提供了很好的说明)以及Kernel Crash Book。这些也恰巧是答案中建议的唯一资源,因此我认为这crash是调查的唯一方法。

如果有另一种方法可以对事件进行事后分析,我愿意接受。正是这是crash我知道的唯一实用程序。我现在所拥有的只是vmcore.flat文件,我所需要知道的是哪个组件占用了所有内存。我怀疑内核模块有问题,更具体地说是绑定模块之一(当我关闭接口时触发了它),DRBD模块(在CentOS 6.3上从树构建的版本8.3.15)或其中一个10G以太网模块(mlnx_en从我关闭的接口树中构建,或在bnx2x保持活动状态的接口中树内构建)。我所需要知道的是,是否有办法证实我的怀疑。

到目前为止,我仅使用崩溃实用程序提取了以下信息:

检查使用了多少内存

$ crash /usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux vmcore.flat
....
crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  16482587      62.9 GB         ----
      FREE    54610     213.3 MB    0% of TOTAL MEM
      USED  16427977      62.7 GB   99% of TOTAL MEM
    SHARED     4683      18.3 MB    0% of TOTAL MEM
   BUFFERS      118       472 KB    0% of TOTAL MEM
    CACHED       82       328 KB    0% of TOTAL MEM
      SLAB    46635     182.2 MB    0% of TOTAL MEM

TOTAL SWAP        0            0         ----
 SWAP USED        0            0  100% of TOTAL SWAP
 SWAP FREE        0            0    0% of TOTAL SWAP

显然,它内存不足。所有64G都消失了...但是在哪里呢?

试图查看是否有任何进程正在泄漏内存

似乎唯一相关psps命令是(这是的子命令crash)。它没有显示任何异常,但也没有显示有关内核线程的任何信息。

crash> ps
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      0      0   0  ffffffff81a8d020  RU   0.0       0      0  [swapper]
>     0      0   1  ffff88102c456040  RU   0.0       0      0  [swapper]
>     0      0   2  ffff88082c772aa0  RU   0.0       0      0  [swapper]
>     0      0   3  ffff88102c456aa0  RU   0.0       0      0  [swapper]
      0      0   4  ffff88082c7b8ae0  RU   0.0       0      0  [swapper]
>     0      0   5  ffff88102c457500  RU   0.0       0      0  [swapper]
>     0      0   6  ffff88082c7d6aa0  RU   0.0       0      0  [swapper]
>     0      0   7  ffff88102c506080  RU   0.0       0      0  [swapper]
>     0      0   8  ffff88082c016ae0  RU   0.0       0      0  [swapper]
>     0      0   9  ffff88102c506ae0  RU   0.0       0      0  [swapper]
>     0      0  10  ffff88082c05caa0  RU   0.0       0      0  [swapper]
>     0      0  11  ffff88102c507540  RU   0.0       0      0  [swapper]
>     0      0  12  ffff88082c09cae0  RU   0.0       0      0  [swapper]
.....
   4926      1   5  ffff880828a38ae0  ??   0.0       0      0  mingetty
   4928      1   1  ffff88102a4e8040  ??   0.0       0      0  mingetty
   4930      1  19  ffff880827af4080  ??   0.0       0      0  mingetty
   4932      1   2  ffff88100f122040  ??   0.0       0      0  mingetty
   4934      1  18  ffff8810296ea080  ??   0.0       0      0  mingetty
   4936   1047   4  ffff880ff342d540  IN   0.0   11184    948  udevd
   4937   1047   5  ffff88082a240080  IN   0.0   11184    948  udevd
   5060   3772   2  ffff88082881d540  ??   0.0       0      0  sshd
   5078      1   1  ffff88100f060ae0  ??   0.0       0      0  sshd
   5079      1   1  ffff88082b882ae0  ??   0.0       0      0  bash

并且,如果我取出内核线程(对于%MEM,所有线程都显示为零),我们可以看到最后几乎没有任何运行:

crash> ps -u
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      1      0   1  ffff88082c41b500  ??   0.0   19348    348  init
   1047      1   2  ffff881029524040  IN   0.0   11188    948  udevd
   3171      1   3  ffff880826ccaaa0  IN   0.0   27636    240  auditd
   3172      1  17  ffff881029d1b500  IN   0.0   27636    240  auditd
>  3772      1   0  ffff88102b257500  RU   0.0   64072    668  sshd
   4800      1   0  ffff88100f061540  ??   0.0       0      0  dsm_om_shrsvcd
   4842      1  16  ffff88100f012ae0  ??   0.0       0      0  cmcld
   4854      1  17  ffff88082a241540  ??   0.0       0      0  cmlogd
   4855      1   3  ffff88082796cae0  ??   0.0       0      0  cmfileassistd
   4856      1  18  ffff88082809d500  ??   0.0       0      0  cmnetd
   4860      1   0  ffff88082705aae0  ??   0.0       0      0  cmresourced
   4924      1   9  ffff88102a4e8aa0  ??   0.0       0      0  mingetty
   4926      1   5  ffff880828a38ae0  ??   0.0       0      0  mingetty
   4928      1   1  ffff88102a4e8040  ??   0.0       0      0  mingetty
   4930      1  19  ffff880827af4080  ??   0.0       0      0  mingetty
   4932      1   2  ffff88100f122040  ??   0.0       0      0  mingetty
   4934      1  18  ffff8810296ea080  ??   0.0       0      0  mingetty
   4936   1047   4  ffff880ff342d540  IN   0.0   11184    948  udevd
   4937   1047   5  ffff88082a240080  IN   0.0   11184    948  udevd
   5060   3772   2  ffff88082881d540  ??   0.0       0      0  sshd
   5078      1   1  ffff88100f060ae0  ??   0.0       0      0  sshd
   5079      1   1  ffff88082b882ae0  ??   0.0       0      0  bash
   5257      1   1  ffff8808279e6aa0  ??   0.0       0      0  jnx_mlnxsnmp_da

更新:

如Soham所建议的,包括更多输出。不幸的是,我无法从中得出任何进一步的结论。我能做的最好的办法就是怀疑内核泄漏内存中的某些内容,因为无论如何用户区进程几乎都死了。

的(几乎已满)输出在log -m这里

crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}'
total 0.00391006GB

请注意,此时几乎所有用户态进程都已死,因此,预期的使用率较低。

内存不足消息:

如上所述,有39条“内存不足:”消息,它们是:

crash> log -m | grep Out
<3>[  223.556616] Out of memory: Kill process 3189 (portreserve) score 1 or sacrifice child
<3>[  223.787234] Out of memory: Kill process 3196 (rsyslogd) score 1 or sacrifice child
<3>[  224.237119] Out of memory: Kill process 3728 (dbus-daemon) score 1 or sacrifice child
<3>[  228.771770] Out of memory: Kill process 3758 (snmpd) score 1 or sacrifice child
<3>[  229.033466] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.257710] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.484321] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.711169] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.934955] Out of memory: Kill process 3801 (cmproxyd) score 1 or sacrifice child
<3>[  230.159542] Out of memory: Kill process 3812 (ntpd) score 1 or sacrifice child
<3>[  230.382083] Out of memory: Kill process 3953 (master) score 1 or sacrifice child
<3>[  230.606613] Out of memory: Kill process 3953 (master) score 1 or sacrifice child
<3>[  230.829515] Out of memory: Kill process 3953 (master) score 1 or sacrifice child
<3>[  230.832105] Out of memory: Kill process 3961 (crond) score 1 or sacrifice child
<3>[  236.749746] Out of memory: Kill process 3974 (atd) score 1 or sacrifice child
<3>[  236.969421] Out of memory: Kill process 4272 (dsm_sa_datamgrd) score 1 or sacrifice child
<3>[  237.192102] Out of memory: Kill process 4492 (dsm_sa_datamgrd) score 1 or sacrifice child
<3>[  237.746301] Out of memory: Kill process 4552 (dsm_sa_eventmgr) score 1 or sacrifice child
<3>[  237.968308] Out of memory: Kill process 4613 (dsm_sa_snmpd) score 1 or sacrifice child
<3>[  238.190550] Out of memory: Kill process 4614 (dsm_sa_snmpd) score 1 or sacrifice child
<3>[  238.644020] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child
<3>[  238.865658] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child
<3>[  251.285450] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child
<3>[  251.506601] Out of memory: Kill process 4800 (dsm_om_shrsvcd) score 1 or sacrifice child
<3>[  251.727570] Out of memory: Kill process 4842 (cmcld) score 1 or sacrifice child
<3>[  251.947085] Out of memory: Kill process 4842 (cmcld) score 1 or sacrifice child
<3>[  252.167096] Out of memory: Kill process 4854 (cmlogd) score 1 or sacrifice child
<3>[  252.384090] Out of memory: Kill process 4855 (cmfileassistd) score 1 or sacrifice child
<3>[  252.603324] Out of memory: Kill process 4924 (mingetty) score 1 or sacrifice child
<3>[  252.820757] Out of memory: Kill process 4926 (mingetty) score 1 or sacrifice child
<3>[  253.037558] Out of memory: Kill process 4928 (mingetty) score 1 or sacrifice child
<3>[  253.254908] Out of memory: Kill process 4930 (mingetty) score 1 or sacrifice child
<3>[  253.257391] Out of memory: Kill process 4932 (mingetty) score 1 or sacrifice child
<3>[  253.259357] Out of memory: Kill process 4934 (mingetty) score 1 or sacrifice child
<3>[  253.261353] Out of memory: Kill process 5060 (sshd) score 1 or sacrifice child
<3>[  253.263365] Out of memory: Kill process 5060 (sshd) score 1 or sacrifice child
<3>[  253.264392] Out of memory: Kill process 5079 (bash) score 1 or sacrifice child
<3>[  253.266352] Out of memory: Kill process 5257 (jnx_mlnxsnmp_da) score 1 or sacrifice child
<0>[  253.529344] Kernel panic - not syncing: Out of memory and no killable processes...

sys输出:

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux
    DUMPFILE: pcdata03.vmcore.flat  [PARTIAL DUMP]
        CPUS: 32
        DATE: Wed Feb  6 02:11:52 2013
      UPTIME: 00:04:12
LOAD AVERAGE: 3.03, 0.95, 0.34
       TASKS: 578
    NODENAME: ....
     RELEASE: 2.6.32-279.5.2.el6.x86_64
     VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012
     MACHINE: x86_64  (2700 Mhz)
      MEMORY: 64 GB
       PANIC: "[  253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."

kmem -z

crash> kmem -z
NODE: 0  ZONE: 0  ADDR: ffff88000000a0c0  NAME: "DMA"
  SIZE: 4095  PRESENT: 3839  MIN/LOW/HIGH: 5/6/7
  VM_STAT:
          NR_FREE_PAGES: 3936
       NR_INACTIVE_ANON: 0
         NR_ACTIVE_ANON: 0
       NR_INACTIVE_FILE: 0
         NR_ACTIVE_FILE: 0
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 0
         NR_FILE_MAPPED: 0
          NR_FILE_PAGES: 0
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 0
  NR_SLAB_UNRECLAIMABLE: 0
           NR_PAGETABLE: 0
        NR_KERNEL_STACK: 0
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 0
               NUMA_HIT: 0
              NUMA_MISS: 0
           NUMA_FOREIGN: 0
    NUMA_INTERLEAVE_HIT: 0
             NUMA_LOCAL: 0
             NUMA_OTHER: 0
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 1  ADDR: ffff880000012780  NAME: "DMA32"
  SIZE: 1044480  PRESENT: 756520  MIN/LOW/HIGH: 1030/1287/1545
  VM_STAT:
          NR_FREE_PAGES: 30117
       NR_INACTIVE_ANON: 0
         NR_ACTIVE_ANON: 0
       NR_INACTIVE_FILE: 1
         NR_ACTIVE_FILE: 0
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 0
         NR_FILE_MAPPED: 0
          NR_FILE_PAGES: 1
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 4
  NR_SLAB_UNRECLAIMABLE: 4150
           NR_PAGETABLE: 0
        NR_KERNEL_STACK: 0
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 0
               NUMA_HIT: 575606
              NUMA_MISS: 3
           NUMA_FOREIGN: 0
    NUMA_INTERLEAVE_HIT: 0
             NUMA_LOCAL: 575598
             NUMA_OTHER: 11
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 2  ADDR: ffff88000001ae40  NAME: "Normal"
  SIZE: 7602176  PRESENT: 7498240  MIN/LOW/HIGH: 10217/12771/15325
  VM_STAT:
          NR_FREE_PAGES: 10443
       NR_INACTIVE_ANON: 134
         NR_ACTIVE_ANON: 197
       NR_INACTIVE_FILE: -47
         NR_ACTIVE_FILE: 42
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 219
         NR_FILE_MAPPED: 115
          NR_FILE_PAGES: 45
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 908
  NR_SLAB_UNRECLAIMABLE: 18771
           NR_PAGETABLE: 91
        NR_KERNEL_STACK: 556
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 34
               NUMA_HIT: 8243991
              NUMA_MISS: 648
           NUMA_FOREIGN: 4593726
    NUMA_INTERLEAVE_HIT: 20066
             NUMA_LOCAL: 8243829
             NUMA_OTHER: 810
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 3  ADDR: ffff880000023500  NAME: "Movable"
  [unpopulated]

NODE: 1  ZONE: 0  ADDR: ffff880840000040  NAME: "DMA"
  [unpopulated]

NODE: 1  ZONE: 1  ADDR: ffff880840008700  NAME: "DMA32"
  [unpopulated]

NODE: 1  ZONE: 2  ADDR: ffff880840010dc0  NAME: "Normal"
  SIZE: 8388608  PRESENT: 8273920  MIN/LOW/HIGH: 11274/14092/16911
  VM_STAT:
          NR_FREE_PAGES: 10114
       NR_INACTIVE_ANON: 417
         NR_ACTIVE_ANON: 83
       NR_INACTIVE_FILE: 47
         NR_ACTIVE_FILE: 32
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 436
         NR_FILE_MAPPED: 22
          NR_FILE_PAGES: 154
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 863
  NR_SLAB_UNRECLAIMABLE: 21939
           NR_PAGETABLE: 134
        NR_KERNEL_STACK: 27
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 3
    NR_VMSCAN_IMMEDIATE: 5
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 23
               NR_SHMEM: 20
               NUMA_HIT: 4332488
              NUMA_MISS: 4593726
           NUMA_FOREIGN: 665
    NUMA_INTERLEAVE_HIT: 20007
             NUMA_LOCAL: 4309300
             NUMA_OTHER: 4616914
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 1  ZONE: 3  ADDR: ffff880840019480  NAME: "Movable"
  [unpopulated]

kmem -f

crash> kmem -f
NODE
  0
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  0   DMA         4095    3936  ffffea0000000038      1000          0     
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff880000012128       2      2
  0       4k  ffff880000012138       0      0
  0       4k  ffff880000012148       0      0
  0       4k  ffff880000012158       0      0
  0       4k  ffff880000012168       0      0
  1       8k  ffff880000012180       1      2
  1       8k  ffff880000012190       0      0
  1       8k  ffff8800000121a0       0      0
  1       8k  ffff8800000121b0       0      0
  1       8k  ffff8800000121c0       0      0
  2      16k  ffff8800000121d8       1      4
  2      16k  ffff8800000121e8       0      0
  2      16k  ffff8800000121f8       0      0
  2      16k  ffff880000012208       0      0
  2      16k  ffff880000012218       0      0
  3      32k  ffff880000012230       1      8
  3      32k  ffff880000012240       0      0
  3      32k  ffff880000012250       0      0
  3      32k  ffff880000012260       0      0
  3      32k  ffff880000012270       0      0
  4      64k  ffff880000012288       1     16
  4      64k  ffff880000012298       0      0
  4      64k  ffff8800000122a8       0      0
  4      64k  ffff8800000122b8       0      0
  4      64k  ffff8800000122c8       0      0
  5     128k  ffff8800000122e0       0      0
  5     128k  ffff8800000122f0       0      0
  5     128k  ffff880000012300       0      0
  5     128k  ffff880000012310       0      0
  5     128k  ffff880000012320       0      0
  6     256k  ffff880000012338       1     64
  6     256k  ffff880000012348       0      0
  6     256k  ffff880000012358       0      0
  6     256k  ffff880000012368       0      0
  6     256k  ffff880000012378       0      0
  7     512k  ffff880000012390       0      0
  7     512k  ffff8800000123a0       0      0
  7     512k  ffff8800000123b0       0      0
  7     512k  ffff8800000123c0       0      0
  7     512k  ffff8800000123d0       0      0
  8    1024k  ffff8800000123e8       1    256
  8    1024k  ffff8800000123f8       0      0
  8    1024k  ffff880000012408       0      0
  8    1024k  ffff880000012418       0      0
  8    1024k  ffff880000012428       0      0
  9    2048k  ffff880000012440       0      0
  9    2048k  ffff880000012450       0      0
  9    2048k  ffff880000012460       0      0
  9    2048k  ffff880000012470       1    512
  9    2048k  ffff880000012480       0      0
 10    4096k  ffff880000012498       0      0
 10    4096k  ffff8800000124a8       0      0
 10    4096k  ffff8800000124b8       3   3072
 10    4096k  ffff8800000124c8       0      0
 10    4096k  ffff8800000124d8       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  1   DMA32     1044480   30117  ffffea0000038000    1000000        4095   
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff88000001a7e8      24     24
  0       4k  ffff88000001a7f8       4      4
  0       4k  ffff88000001a808      13     13
  0       4k  ffff88000001a818       0      0
  0       4k  ffff88000001a828       0      0
  1       8k  ffff88000001a840       2      4
  1       8k  ffff88000001a850       2      4
  1       8k  ffff88000001a860       4      8
  1       8k  ffff88000001a870       0      0
  1       8k  ffff88000001a880       0      0
  2      16k  ffff88000001a898       0      0
  2      16k  ffff88000001a8a8       3     12
  2      16k  ffff88000001a8b8       4     16
  2      16k  ffff88000001a8c8       0      0
  2      16k  ffff88000001a8d8       0      0
  3      32k  ffff88000001a8f0       0      0
  3      32k  ffff88000001a900       3     24
  3      32k  ffff88000001a910       3     24
  3      32k  ffff88000001a920       0      0
  3      32k  ffff88000001a930       0      0
  4      64k  ffff88000001a948       1     16
  4      64k  ffff88000001a958       3     48
  4      64k  ffff88000001a968       6     96
  4      64k  ffff88000001a978       0      0
  4      64k  ffff88000001a988       0      0
  5     128k  ffff88000001a9a0       0      0
  5     128k  ffff88000001a9b0       3     96
  5     128k  ffff88000001a9c0       7    224
  5     128k  ffff88000001a9d0       0      0
  5     128k  ffff88000001a9e0       0      0
  6     256k  ffff88000001a9f8       0      0
  6     256k  ffff88000001aa08       1     64
  6     256k  ffff88000001aa18       6    384
  6     256k  ffff88000001aa28       0      0
  6     256k  ffff88000001aa38       0      0
  7     512k  ffff88000001aa50       1    128
  7     512k  ffff88000001aa60       0      0
  7     512k  ffff88000001aa70       8   1024
  7     512k  ffff88000001aa80       0      0
  7     512k  ffff88000001aa90       0      0
  8    1024k  ffff88000001aaa8       1    256
  8    1024k  ffff88000001aab8       1    256
  8    1024k  ffff88000001aac8       5   1280
  8    1024k  ffff88000001aad8       0      0
  8    1024k  ffff88000001aae8       0      0
  9    2048k  ffff88000001ab00       0      0
  9    2048k  ffff88000001ab10       1    512
  9    2048k  ffff88000001ab20       3   1536
  9    2048k  ffff88000001ab30       1    512
  9    2048k  ffff88000001ab40       0      0
 10    4096k  ffff88000001ab58       0      0
 10    4096k  ffff88000001ab68       0      0
 10    4096k  ffff88000001ab78      22  22528
 10    4096k  ffff88000001ab88       1   1024
 10    4096k  ffff88000001ab98       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  2   Normal    7602176   10443  ffffea0003800000   100000000     1048575  
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff880000022ea8     365    365
  0       4k  ffff880000022eb8     274    274
  0       4k  ffff880000022ec8     274    274
  0       4k  ffff880000022ed8       0      0
  0       4k  ffff880000022ee8       0      0
  1       8k  ffff880000022f00      99    198
  1       8k  ffff880000022f10      94    188
  1       8k  ffff880000022f20     360    720
  1       8k  ffff880000022f30       0      0
  1       8k  ffff880000022f40       0      0
  2      16k  ffff880000022f58      30    120
  2      16k  ffff880000022f68      41    164
  2      16k  ffff880000022f78     204    816
  2      16k  ffff880000022f88       0      0
  2      16k  ffff880000022f98       0      0
  3      32k  ffff880000022fb0       9     72
  3      32k  ffff880000022fc0      19    152
  3      32k  ffff880000022fd0     138   1104
  3      32k  ffff880000022fe0       0      0
  3      32k  ffff880000022ff0       0      0
  4      64k  ffff880000023008       7    112
  4      64k  ffff880000023018       4     64
  4      64k  ffff880000023028      77   1232
  4      64k  ffff880000023038       0      0
  4      64k  ffff880000023048       0      0
  5     128k  ffff880000023060       3     96
  5     128k  ffff880000023070       3     96
  5     128k  ffff880000023080      43   1376
  5     128k  ffff880000023090       0      0
  5     128k  ffff8800000230a0       0      0
  6     256k  ffff8800000230b8       0      0
  6     256k  ffff8800000230c8       0      0
  6     256k  ffff8800000230d8      13    832
  6     256k  ffff8800000230e8       0      0
  6     256k  ffff8800000230f8       0      0
  7     512k  ffff880000023110       0      0
  7     512k  ffff880000023120       0      0
  7     512k  ffff880000023130       5    640
  7     512k  ffff880000023140       0      0
  7     512k  ffff880000023150       0      0
  8    1024k  ffff880000023168       0      0
  8    1024k  ffff880000023178       0      0
  8    1024k  ffff880000023188       0      0
  8    1024k  ffff880000023198       0      0
  8    1024k  ffff8800000231a8       0      0
  9    2048k  ffff8800000231c0       0      0
  9    2048k  ffff8800000231d0       0      0
  9    2048k  ffff8800000231e0       1    512
  9    2048k  ffff8800000231f0       0      0
  9    2048k  ffff880000023200       0      0
 10    4096k  ffff880000023218       0      0
 10    4096k  ffff880000023228       0      0
 10    4096k  ffff880000023238       0      0
 10    4096k  ffff880000023248       1   1024
 10    4096k  ffff880000023258       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  3   Movable        0       0         0               0            0     

--------------------------------------------------------------------------

NODE
  1
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  0   DMA            0       0         0               0            0     

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  1   DMA32          0       0         0               0            0     

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  2   Normal    8388608   10114  ffffea001ce00000   840000000        0     
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff880840018e28     405    405
  0       4k  ffff880840018e38     162    162
  0       4k  ffff880840018e48     317    317
  0       4k  ffff880840018e58       0      0
  0       4k  ffff880840018e68       0      0
  1       8k  ffff880840018e80     106    212
  1       8k  ffff880840018e90      70    140
  1       8k  ffff880840018ea0     269    538
  1       8k  ffff880840018eb0       0      0
  1       8k  ffff880840018ec0       0      0
  2      16k  ffff880840018ed8      24     96
  2      16k  ffff880840018ee8      18     72
  2      16k  ffff880840018ef8     207    828
  2      16k  ffff880840018f08       0      0
  2      16k  ffff880840018f18       0      0
  3      32k  ffff880840018f30      20    160
  3      32k  ffff880840018f40       4     32
  3      32k  ffff880840018f50     148   1184
  3      32k  ffff880840018f60       0      0
  3      32k  ffff880840018f70       0      0
  4      64k  ffff880840018f88      17    272
  4      64k  ffff880840018f98       2     32
  4      64k  ffff880840018fa8      95   1520
  4      64k  ffff880840018fb8       0      0
  4      64k  ffff880840018fc8       0      0
  5     128k  ffff880840018fe0       4    128
  5     128k  ffff880840018ff0       1     32
  5     128k  ffff880840019000      37   1184
  5     128k  ffff880840019010       0      0
  5     128k  ffff880840019020       0      0
  6     256k  ffff880840019038       0      0
  6     256k  ffff880840019048       0      0
  6     256k  ffff880840019058       8    512
  6     256k  ffff880840019068       0      0
  6     256k  ffff880840019078       0      0
  7     512k  ffff880840019090       0      0
  7     512k  ffff8808400190a0       0      0
  7     512k  ffff8808400190b0       1    128
  7     512k  ffff8808400190c0       0      0
  7     512k  ffff8808400190d0       0      0
  8    1024k  ffff8808400190e8       0      0
  8    1024k  ffff8808400190f8       0      0
  8    1024k  ffff880840019108       1    256
  8    1024k  ffff880840019118       0      0
  8    1024k  ffff880840019128       0      0
  9    2048k  ffff880840019140       0      0
  9    2048k  ffff880840019150       0      0
  9    2048k  ffff880840019160       1    512
  9    2048k  ffff880840019170       1    512
  9    2048k  ffff880840019180       0      0
 10    4096k  ffff880840019198       0      0
 10    4096k  ffff8808400191a8       0      0
 10    4096k  ffff8808400191b8       0      0
 10    4096k  ffff8808400191c8       1   1024
 10    4096k  ffff8808400191d8       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  3   Movable        0       0         0               0            0     

nr_free_pages: 54610  (found 54742)

一点运动 foreach bt

crash> foreach bt | awk '$1 == "#0" { $2 = ""; print }' | sort | uniq -c
     31 #0  crash_nmi_callback at ffffffff81029df6
      1 #0  machine_kexec at ffffffff8103281b
    546 #0  schedule at ffffffff814fda62

实际上,它们要么崩溃要么正在等待内存(或者我没有正确读取它)。

Answers:


8

检查前20位最大的物理内存使用者(驻留集大小)。

crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20

检查大页数。

crash> p -d nr_huge_pages

更新:

A)故障转储是从以下内核版本捕获的。

$ crash --osrelease vmcore.flat 
2.6.32-279.5.2.el6.x86_64                    

B)让我们从kernel-debug-debuginfo包中提取vmlinux文件。

$ rpm2cpio kernel-debug-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm | \
  cpio -idv ./usr/lib/debug/lib/modules/*/vmlinux

C)使用崩溃实用程序打开vmcore文件。

$ bunzip2 vmcore.flat.bz2 
$ crash vmcore.flat ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux

D)系统信息。

crash> sys
      KERNEL: ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux
    DUMPFILE: vmcore.flat  [PARTIAL DUMP]
        CPUS: 32
        DATE: Tue Feb  5 12:11:52 2013
      UPTIME: 00:04:12
LOAD AVERAGE: 3.03, 0.95, 0.34
       TASKS: 578
    NODENAME: ...
     RELEASE: 2.6.32-279.5.2.el6.x86_64
     VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012
     MACHINE: x86_64  (2700 Mhz)
      MEMORY: 64 GB
       PANIC: "[  253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."

a)由于内存不足而发生了紧急情况,但是系统上禁用了“ panic_on_oom”参数。 

crash> p -d sysctl_panic_on_oom
sysctl_panic_on_oom = $6 = 0

此参数启用或禁用内存不足功能时的紧急情况。如果将其设置为0,则内核将杀死一些流氓进程,称为oom_killer。通常,oom_killer可以杀死流氓进程,并且系统可以生存。如果将其设置为1,则发生内存不足时内核会发生混乱。

b)那么,我们如何在oom事件发生时捕获vmcore?

好吧,让我们检查一下mm / oom_kill.c源代码。它说,如果系统上没有任何东西要杀死,则只需挂起或死机。

++++++
499         /* Found nothing?!?! Either we hang forever, or we panic. */   
500         if (!p) {                                                  
501             read_unlock(&tasklist_lock);                                     
502             cpuset_unlock();                                        
503             panic("Out of memory and no killable processes...\n");  <<<------  
504         }                                                       
505 
++++++

因此,我们进入了紧急状态,并且在此系统上启用了kdump服务后,捕获了vmcore。

E)让我们检查内核环形缓冲区, 

crash> log
[..]
[  253.351427] Node 0 DMA free:15744kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15356kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  253.352234] lowmem_reserve[]: 0 2955 32245 32245
[  253.352812] Node 0 DMA32 free:120436kB min:4120kB low:5148kB high:6180kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3026080kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:16600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1 all_unreclaimable? no
[  253.353637] lowmem_reserve[]: 0 0 29290 29290
[  253.354216] Node 0 Normal free:40580kB min:40868kB low:51084kB high:61300kB active_anon:956kB inactive_anon:536kB active_file:260kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29992960kB mlocked:0kB dirty:0kB writeback:0kB mapped:460kB shmem:136kB slab_reclaimable:3640kB slab_unreclaimable:75128kB kernel_stack:4448kB pagetables:428kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  253.355047] lowmem_reserve[]: 0 0 0 0
[  253.355624] Node 1 Normal free:39896kB min:45096kB low:56368kB high:67644kB active_anon:412kB inactive_anon:1668kB active_file:288kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):220kB present:33095680kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:80kB slab_reclaimable:3496kB slab_unreclaimable:87864kB kernel_stack:216kB pagetables:564kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  253.356457] lowmem_reserve[]: 0 0 0 0
[  253.357034] Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15744kB
[  253.358351] Node 0 DMA32: 41*4kB 8*8kB 7*16kB 6*32kB 10*64kB 10*128kB 7*256kB 9*512kB 7*1024kB 5*2048kB 23*4096kB = 120468kB
[  253.359674] Node 0 Normal: 718*4kB 558*8kB 278*16kB 169*32kB 88*64kB 47*128kB 13*256kB 5*512kB 0*1024kB 1*2048kB 1*4096kB = 40872kB
[  253.360995] Node 1 Normal: 876*4kB 447*8kB 249*16kB 174*32kB 116*64kB 40*128kB 8*256kB 1*512kB 1*1024kB 2*2048kB 1*4096kB = 40952kB
[  253.362319] 154 total pagecache pages
[  253.362502] 0 pages in swap cache
[  253.362684] Swap cache stats: add 0, delete 0, find 0/0
[  253.362869] Free swap  = 0kB
[  253.363050] Total swap = 0kB
[  253.526814] 16777215 pages RAM
[  253.526999] 294628 pages reserved
[  253.527190] 114911 pages shared
[  253.527372] 16392561 pages non-shared
[..]

F)让我们在崩溃时检查系统上的内存状态。

crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  16482587      62.9 GB         ----        -------------------------------+
      FREE    54610     213.3 MB    0% of TOTAL MEM                                  |
      USED  16427977      62.7 GB   99% of TOTAL MEM                                 |
    SHARED     4683      18.3 MB    0% of TOTAL MEM                                  |
   BUFFERS      118       472 KB    0% of TOTAL MEM                                  |
    CACHED       82       328 KB    0% of TOTAL MEM                                  |
      SLAB    46635     182.2 MB    0% of TOTAL MEM                                  |
                                                                                     |
TOTAL SWAP        0            0         ----         ----------------------+        |
 SWAP USED        0            0  100% of TOTAL SWAP                        |        |
 SWAP FREE        0            0    0% of TOTAL SWAP                        |        |
                                                                            |        | 
                                                                            |        |
crash> p -d totalram_pages                                                  |        |
totalram_pages = $5 = 16482587                                              |        |
                                                                            |        |
crash> !echo "scale=5;(16482587*4096)/2^30"|bc -q                           |        |
62.87607                  <<<-----[ Total physical memory is 62.9 GB ] <<<--|--------+
                                                                            |
crash> p -d total_swap_pages                                                |
total_swap_pages = $6 = 0 <<<------[ No Swap on the system ]  <<<-----------+
  • 我们共有〜63GiB的物理内存。
  • 交换分区或文件未在系统上创建,因此我们在此服务器上没有交换。
  • 用于缓存的内存少得多,只有328KB,用于缓冲区的内存只有472KB。
  • 平板中使用的内存也非常少,只有182.2 MB。

G)分配给进程的总内存为0.00391006GiB。

crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}'
total 0.00391006GB

H)应用程序进程未利用系统上的内存。 

crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20
965 2 21 ffff8808292f1500 IN 0.0 0 0 MB [ext4-dio-unwrit]
966 2 22 ffff8808292d4080 IN 0.0 0 0 MB [ext4-dio-unwrit]
967 2 23 ffff8808292ce040 IN 0.0 0 0 MB [ext4-dio-unwrit]
968 2 24 ffff8808299b5540 IN 0.0 0 0 MB [ext4-dio-unwrit]
969 2 25 ffff880829aa6040 IN 0.0 0 0 MB [ext4-dio-unwrit]
970 2 26 ffff880827367500 IN 0.0 0 0 MB [ext4-dio-unwrit]
971 2 27 ffff880827366aa0 IN 0.0 0 0 MB [ext4-dio-unwrit]
972 2 28 ffff880827366040 IN 0.0 0 0 MB [ext4-dio-unwrit]
97 2 23 ffff88082c1ac080 IN 0.0 0 0 MB [ksoftirqd/23]
973 2 29 ffff880827371540 IN 0.0 0 0 MB [ext4-dio-unwrit]
974 2 30 ffff880827370ae0 IN 0.0 0 0 MB [ext4-dio-unwrit]
975 2 31 ffff880827370080 IN 0.0 0 0 MB [ext4-dio-unwrit]
98 2 23 ffff88082c1bb500 IN 0.0 0 0 MB [watchdog/23]
99 2 24 ffff88082c1baaa0 IN 0.0 0 0 MB [migration/24]
3171 1 3 ffff880826ccaaa0 IN 0.0 27636 0.234375 MB auditd
1 0 1 ffff88082c41b500 UN 0.0 19348 0.339844 MB init
3772 1 0 ffff88102b257500 RU 0.0 64072 0.652344 MB sshd
1047 1 2 ffff881029524040 IN 0.0 11188 0.925781 MB udevd
4936 1047 4 ffff880ff342d540 IN 0.0 11184 0.925781 MB udevd
4937 1047 5 ffff88082a240080 IN 0.0 11184 0.925781 MB udevd

I)让我们验证系统上的内存调整参数。

crash> p -d sysctl_overcommit_memory
sysctl_overcommit_memory = $7 = 0

该值包含一个标志,用于启用内存过量使用。当此标志为0时,内核将尝试估计用户空间请求更多内存时剩余的可用内存量。

crash> p -d sysctl_overcommit_ratio
sysctl_overcommit_ratio = $8 = 50

当overcommit_memory设置为2时,不允许已提交的地址空间超过swap加上该物理RAM的百分比。 

crash> p -d zone_reclaim_mode 
zone_reclaim_mode = $4 = 0

Zone_reclaim_mode允许某人在区域内存不足时设置或多或少积极的方法来回收内存。如果将其设置为零,则不会发生区域回收。

crash> p -d min_free_kbytes
min_free_kbytes = $3 = 90112  <<<--------[ 88 MB ]

在整个系统中保持空闲状态的最小千字节数。此值用于计算每个低内存区域的水印值,然后为这些低水印区域分配与它们的大小成比例的许多保留的空闲页。设置此参数时,太低和太高的值都可能会损坏。

换句话说,设置min_free_kbytes得太低会阻止系统回收内存。这可能导致系统挂起并杀死多个进程的OOM。但是,将此参数设置为太高的值(占系统总内存的5-10%)将导致您的系统立即变为内存不足。Linux被设计为使用所有可用的RAM来缓存文件系统数据。设置高的min_free_kbytes值会导致系统花费太多时间来回收内存。 

以上参数的值看起来还可以,所以我的内存在哪里?

假设:

  1. 主要犯罪者不在用户空间中。根据我的经验,无法解释的内存是由于Mellanox和DRBD模块引起的,但是我不确定您的情况。 
  2. 由于为了减少vmcore文件(core_collector makedumpfile -d 31 -c)的大小,从vmcore文件中丢弃了大多数页面。我无法检查巨大的页面大小。  

ps显示几乎没有进程占用任何内存。实际上,ps -G仅显示RSS非零的五个条目,这些数字的总和为4100(请参阅ps -u问题中的输出)。第二条命令说No symbol "nr_huge_pages" in current context.
chutz

如果可以,您可以共享vmcore文件,以便我进行深入探讨吗?我将在这里发布我的发现和确切步骤.. :)
BKS

tmpfs是匿名支持的-是否可以在崩溃时检查tmpfs的页面分配?
Matthew Ife

对于那些热爱并想要玩崩溃的人:people.redhat.com/anderson/.crash_whitepaper
BKS

3

您可以运行此命令吗?

ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {Print "total " mem/1048576 "GB"}'

另外,kmem -z和kmem -f可能会有所帮助。

但是,请注意整个交换空间已被消耗。

您必须在log命令中有一些OOM消息。您可以粘贴OOM消息吗?我也可以看到模式和比赛条件。

另一方面,sys输出将极大地帮助您。您知道,内存泄漏和内核升级几乎是同义词;)

编辑:自己尝试一下。foreach bt

它将显示每个PID的跟踪。寻找任何常见的模式,可能是所有人都在等待该schedule_at功能。含义等待内存分配。

现在,注意显示foreach bt中每个PID的部分中的TASK值。

运行这个。

tasK_struct.tgid <TASK>

如果它们来自相同的PID,则会看到您正在查看的进程。

我忘记了如何检查过量使用的值,我将看看是否可以找到。


谢谢。我用您认为可能有用的一些信息更新了问题。请注意有关交换空间的信息-它并没有全部消耗掉,根本没有配置交换空间。还应注意,崩溃时几乎所有用户区进程都已死掉(OOM杀手负责此事)。我什至粘贴了“ log -m” -pastebin.com/BzvqdyyZ几乎完整的输出(pastebin只允许我上传500K)。
chutz

1

我对崩溃转储的分析没有太多经验,因此无法为您提供一些具体建议,但是这里有一些我不时收集和使用的链接。也许您会发现一些有用的东西:


好吧,我已经做了所有这些。我能够运行crash,并且Crash书中提到了该bt命令(已经运行过),该log命令向我显示了Out of memory消息以及该ps命令没有告诉我我的内存在哪里:(
chutz

0

您可能知道这一点,但根本没有指出这是您所做的,但是ps有很多选项。

# ps aux

要么

# ps -edf 

因为root会吐出更多详细信息。

此页面上有很多有用的建议,可以帮助您查找内存问题:

http://www.linuxnix.com/2011/05/find-ram-utilization-user-linux.html

我会检查

# free

定期查看实际使用了多少内存。(或者更好的方法是将其绘制成图表;石墨或穆宁非常适合此操作),以便您可以直观地看到何时/如何使用内存。

64公制的公羊相当多;主持人做什么工作?


1
我发布的是的ps子命令的输出crash。不是psshell命令。我正在对该事件进行事后分析,而我所拥有的只是vmcore映像。我只能假设我应该能够看到所有内存的分配位置。
chutz
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.