我在同一个vm上的VMWare服务器上运行了2个进程。 Centos 6.x
我在两个进程上运行了strace并保存了输出
6970 14:04:09.643295 futex(0x7f47d8027754, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f47d8027750, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
6969 14:04:09.643304 futex(0x7f47d8027754, FUTEX_WAIT_PRIVATE, 54869, NULL <unfinished ...>
6971 14:04:09.643353 futex(0x7f47d802b128, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
6970 14:04:09.643363 <... futex resumed> ) = 0 <0.000063>
6969 14:04:09.643372 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000063>
6971 14:04:09.643411 <... futex resumed> ) = 0 <0.000052>
6969 14:04:09.643420 futex(0x7f47d8027728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
6971 14:04:09.643459 futex(0x7f47d8030854, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f47d8030850, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
6969 14:04:09.643469 <... futex resumed> ) = 0 <0.000044>
6971 14:04:09.643501 <... futex resumed> ) = 1 <0.000037>
6974 14:04:09.650775 <... futex resumed> ) = 0 <0.511421>
17035 14:04:12.446009 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <2.902876>
17009 14:04:12.446112 <... poll resumed> ) = 1 ([\{fd=271, revents=POLLIN}]) <3.742208>
17008 14:04:12.446131 <... poll resumed> ) = 1 ([\{fd=193, revents=POLLIN}]) <3.747483>
24085 14:04:12.446144 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <2.822784>
24082 14:04:12.446157 <... poll resumed> ) = 1 ([\{fd=300, revents=POLLIN}]) <4.404341>
6416 14:04:12.446172 <... poll resumed> ) = 1 ([\{fd=187, revents=POLLIN}]) <3.557929>
18296 14:04:12.446189 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <2.822459>
18295 14:04:12.446201 <... poll resumed> ) = 1 ([\{fd=290, revents=POLLIN}]) <3.518658>
关键部分是
6974 14:04:09.650775 <... futex resumed> ) = 0 <0.511421>
17035 14:04:12.446009 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <2.902876>
另一个过程是
7510 14:04:09.622601 futex(0x7f8874033728, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000024>
7510 14:04:09.622698 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={72, 763938}, ru_stime={3, 512466}, ...}) = 0 <0.000023>
7510 14:04:09.622761 futex(0x7f8874033754, FUTEX_WAIT_BITSET_PRIVATE, 1, {162879, 332257449}, ffffffff <unfinished ...>
7543 14:04:09.644930 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.050089>
7543 14:04:09.644991 futex(0x7f88757c9128, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000039>
7543 14:04:09.645091 futex(0x7f88757c9154, FUTEX_WAIT_BITSET_PRIVATE, 1, {162879, 104576419}, ffffffff <unfinished ...>
7766 14:04:09.671201 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100101>
17254 14:04:12.445858 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <2.859094>
24083 14:04:12.446056 <... poll resumed> ) = 1 ([\{fd=277, revents=POLLIN}]) <4.404263>
关键位是
7766 14:04:09.671201 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100101>
17254 14:04:12.445858 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <2.859094>
大约在同一时间java进程停止并且它们超时。
现在我有其他日志显示那段时间发生的事情 - 我相信。
我该如何追踪futex是什么?或者我如何才能找到更多信息/我可以获取更多数据来诊断此问题?
可能是一个Linux内核错误。看到 这里 。
—
xenoid
@xenoid这是2015年 - 从那时起的任何内核今天都会非常不安全(前幽灵等)。
—
Eugen Rieck
我看到了内核错误,但指出它已经相当陈旧了
—
Keyzer Suze
我正在运行最新的centos 6.x
—
Keyzer Suze