+200个并发连接后NGINX超时

12

这是我的nginx.conf（我已更新配置，以确保不涉及PHP或任何其他瓶颈）：

user                nginx;
worker_processes    4;
worker_rlimit_nofile 10240;

pid                 /var/run/nginx.pid;

events
{
    worker_connections  1024;
}

http
{
    include             /etc/nginx/mime.types;

    error_log           /var/www/log/nginx_errors.log warn;

    port_in_redirect    off;
    server_tokens       off;
    sendfile            on;
    gzip                on;

    client_max_body_size 200M;

    map $scheme $php_https { default off; https on; }

    index index.php;

    client_body_timeout   60;
    client_header_timeout 60;
    keepalive_timeout     60 60;
    send_timeout          60;

    server
    {
        server_name dev.anuary.com;

        root        "/var/www/virtualhosts/dev.anuary.com";
    }
}

我正在使用http://blitz.io/play测试我的服务器（我购买了10000个并发连接计划）。在30秒钟的跑步中，我得到了964点击和5,587 timeouts。第一次超时发生在测试的40.77秒处，当时并发用户数为200。

在测试期间，服务器负载为（top输出）：

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                               20225 nginx     20   0 48140 6248 1672 S 16.0  0.0   0:21.68 nginx                                                                  
    1 root      20   0 19112 1444 1180 S  0.0  0.0   0:02.37 init                                                                   
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                               
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.03 migration/0

因此，这不是服务器资源问题。之后怎么样了？

更新2011 12 09 GMT 17:36。

到目前为止，我进行了以下更改以确保瓶颈不是TCP / IP。添加到/etc/sysctl.conf：

# These ensure that TIME_WAIT ports either get reused or closed fast.
net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_tw_recycle = 1
# TCP memory
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 4096

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2

一些更多的调试信息：

[root@server node]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 126767
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

注意worker_rlimit_nofile：设置为10240nginx config。

更新2011年12月9日格林尼治标准时间19:02。

看起来我所做的更改越多，效果越差，但是这里是新的配置文件。

user                nginx;
worker_processes    4;
worker_rlimit_nofile 10240;

pid                 /var/run/nginx.pid;

events
{
    worker_connections  2048;
    #1,353 hits, 2,751 timeouts, 72 errors - Bummer. Try again?
    #1,408 hits, 2,727 timeouts - Maybe you should increase the timeout?
}

http
{
    include             /etc/nginx/mime.types;

    error_log           /var/www/log/nginx_errors.log warn; 

    # http://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads/
    access_log              off;

    open_file_cache         max=1000;
    open_file_cache_valid   30s;

    client_body_buffer_size 10M;
    client_max_body_size    200M;

    proxy_buffers           256 4k;
    fastcgi_buffers         256 4k;

    keepalive_timeout       15 15;

    client_body_timeout     60;
    client_header_timeout   60;

    send_timeout            60;

    port_in_redirect        off;
    server_tokens           off;
    sendfile                on;

    gzip                    on;
    gzip_buffers            256 4k;
    gzip_comp_level         5;
    gzip_disable            "msie6";



    map $scheme $php_https { default off; https on; }

    index index.php;



    server
    {
        server_name ~^www\.(?P<domain>.+);
        rewrite     ^ $scheme://$domain$request_uri? permanent;
    }

    include /etc/nginx/conf.d/virtual.conf;
}

更新2011 12 11 GMT 20:11。

这是netstat -ntla测试期间的输出。

https://gist.github.com/d74750cceba4d08668ea

更新2011 12 12 GMT 10:54。

为了澄清起见，iptables（防火墙）在测试时处于关闭状态。

更新2011 12 12 GMT 22:47。

这是sysctl -p | grep mem转储。

net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_mem = 8388608 8388608 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.ipv4.route.flush = 1
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_max = 8388608
net.core.wmem_default = 65536
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 4096
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2

更新2011 12 12 GMT 22:49

我正在blitz.io运行所有测试。我正在测试的URL是http://dev.anuary.com/test.txt，使用以下命令：--region ireland --pattern 200-250:30 -T 1000 http://dev.anuary.com/test.txt

更新2011 12 13 GMT 13:33

nginx用户限制（在中设置/etc/security/limits.conf）。

nginx       hard nofile 40000
nginx       soft nofile 40000

— 家伙
source

您自己托管这个吗？服务器前没有负载均衡器或类似的东西？ISP可能会将其检测为DDoS攻击并将其降低吗？

— 巴特·西尔弗斯

是的，这是我的服务器。ovh.co.uk/dedicated_servers/eg_ssd.xml没有什么能减少DDoS攻击。我也增加worker_processes至4。

— 加茹斯2011年

刚刚与OVH联系以再次检查我的服务器上没有实现任何网络级安全。不，没有。

— 朱斯2011年

您要从中提供什么样的数据？HTML，图片等？

— pablo

1

我认为运行本地基准测试来排除nginx配置会有所帮助。是不是

— 3molo 2011年

2

在测试期间，您将需要转储网络连接。尽管服务器的负载可能接近零，但您的TCP / IP堆栈可能正在计费。在netstat输出中查找TIME_WAIT连接。

如果是这种情况，那么您将需要检查有关与TCP等待状态，TCP收回和类似指标有关的tcp / ip内核参数的调整。

另外，您还没有描述要测试的内容。

我总是测试：

静态内容（图像或文本文件）
简单的php页面（例如phpinfo）
申请页面

这可能不适用于您的情况，但这是性能测试时要做的事情。测试不同类型的文件可以帮助您精确定位文件。

即使使用静态内容，测试不同大小的文件对于获取超时和其他指标也很重要。

我们有一些静态内容Nginx盒子可以处理3000多个活动连接。因此Nginx肯定可以做到。

更新：您的netstat显示许多打开的连接。可能想尝试调整您的TCP / IP堆栈。另外，您要什么文件？Nginx应该迅速关闭端口。

这是对sysctl.conf的建议：

net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

这些值非常低，但是我已经在高并发Nginx盒子上获得了成功。

— 杰法卡迪
source

见UPDATE 2011 12 09 GMT 17:36.

— Gajus 2011年

由于代码的缘故，已将更新添加到主要回复。

— jeffatrackaid 2011年

请在测试期间添加完整的顶部输出，您不应该只检查nginx使用了多少CPU。

— 乔瓦尼·托拉尔多

1

使用net.ipv4.tcp_tw_recycle = 1时要小心，一般而言：不是一个好主意。重用是可以的。

— 匿名一人

为什么不使用Linux套接字而不是localhost？

— BigSack 2013年

1

另一个假设。您的数量增加了worker_rlimit_nofile，但是在文档中将最大客户端数定义为

max_clients = worker_processes * worker_connections

如果您尝试加注worker_connections到8192，该怎么办？或者，如果有足够的CPU核心，请增加worker_processes？

— 米纳耶夫
source

1

我在使用nginx框作为apache服务器上游的负载平衡器时遇到了非常相似的问题。

以我为例，由于上游apache服务器超负荷，我能够将问题与网络相关。当整个系统处于负载状态时，我可以使用简单的bash脚本重新创建它。根据一个挂起进程的痕迹，connect调用正在获取ETIMEDOUT。

这些设置（在Nginx和上游服务器上）为我消除了问题。在进行这些更改之前（框处理〜100 reqs / s），我每分钟收到1或2次超时，现在变为0。

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 20480
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 400000
net.core.somaxconn = 4096

我不建议使用net.ipv4.tcp_tw_recycle或net.ipv4.tcp_tw_reuse，但是如果您想与后者一起使用。如果根本没有任何延迟，它们可能会引起奇怪的问题，而后者至少是两者中最安全的。

我认为将tcp_fin_timeout设置为1以上可能也会引起一些麻烦。尝试将其设置为20/30-仍远低于默认值。

— Gtuhl
source

0

也许不是nginx问题，而在blitz.io上进行测试时，请执行以下操作：

tail -f /var/log/php5-fpm.log

（这就是我用来处理php的内容）

这将触发错误，并且超时开始增加：

WARNING: [pool www] server reached pm.max_children setting (5), consider raising it

因此，在fmp conf上放置更多max_children并完成操作！; D

— jipipayo
source

如果我return 200 "test"在NGINX中也有同样的问题。这意味着NGINX甚至不及调用PHP-FPM。

— Gajus 2012年

0

太低max open files（1024），请尝试更改并重新启动Nginx。（cat /proc/<nginx>/limits确认）

ulimit -n 10240

并增加到worker_connections10240或更高。

— user3368344
source

我不确定为什么这被否决了。听起来像是对我的正确答案。

— Ryan Angilly '16