9

我的托管帐户中的某些网站上存在EXTREME机器人问题。僵尸程序占用了我整个托管帐户超过98％的CPU资源和99％的带宽。这些漫游器每小时为我的网站产生超过1 GB的流量。所有这些站点的实际人流量都不超过100 MB /月。

我已经对robots.txt和.htaccess文件进行了广泛的研究，以阻止这些bot，但是所有方法均失败了。

我也将代码放在robots.txt文件中，以阻止对脚本目录的访问，但是这些漫游器（Google，MS Bing和Yahoo）会忽略规则并始终运行脚本。

我不想完全阻止Google，MS Bing和Yahoo僵尸程序，但我想限制那里的爬网率。另外，在robots.txt文件中添加抓取延迟语句不会降低机器人的速度。下面列出了我当前用于所有网站的robots.txt和.htacces代码。

我已经设置了Microsoft和Google网站管理员工具，以将抓取速度降低到绝对最小值，但是它们仍然以10次点击/秒的速度访问这些网站。

此外，每次我上载导致错误的文件时，整个VPS网络服务器都将在几秒钟内关闭，以至于由于这些bot的点击量激增，我什至无法访问该网站纠正此问题。

我该怎么做才能阻止网站流量的出现？

在过去的几个月中，我多次尝试向我的网络托管公司（site5.com）询问此问题，他们无法帮助我解决此问题。

我真正需要的是阻止Bots运行rss2html.php脚本。我尝试了会话和Cookie，但都失败了。

robots.txt

User-agent: Mediapartners-Google
Disallow: 
User-agent: Googlebot
Disallow: 
User-agent: Adsbot-Google
Disallow: 
User-agent: Googlebot-Image
Disallow: 
User-agent: Googlebot-Mobile
Disallow: 
User-agent: MSNBot
Disallow: 
User-agent: bingbot
Disallow: 
User-agent: Slurp
Disallow: 
User-Agent: Yahoo! Slurp
Disallow: 
# Directories
User-agent: *
Disallow: /
Disallow: /cgi-bin/
Disallow: /ads/
Disallow: /assets/
Disallow: /cgi-bin/
Disallow: /phone/
Disallow: /scripts/
# Files
Disallow: /ads/random_ads.php
Disallow: /scripts/rss2html.php
Disallow: /scripts/search_terms.php
Disallow: /scripts/template.html
Disallow: /scripts/template_mobile.html

.htaccess

ErrorDocument 400 http://english-1329329990.spampoison.com
ErrorDocument 401 http://english-1329329990.spampoison.com
ErrorDocument 403 http://english-1329329990.spampoison.com
ErrorDocument 404 /index.php
SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
SetEnvIfNoCase User-Agent "^baidu*" bad_bot
Order Deny,Allow
Deny from env=bad_bot
RewriteEngine on
RewriteCond %{HTTP_user_agent} bot\* [OR]
RewriteCond %{HTTP_user_agent} \*bot
RewriteRule ^.*$ http://english-1329329990.spampoison.com [R,L]
RewriteCond %{QUERY_STRING} mosConfig_[a-zA-Z_]{1,21}(=|\%3D) [OR]
# Block out any script trying to base64_encode crap to send via URL
RewriteCond %{QUERY_STRING} base64_encode.*\(.*\) [OR]
# Block out any script that includes a <script> tag in URL
RewriteCond %{QUERY_STRING} (\<|%3C).*script.*(\>|%3E) [NC,OR]
# Block out any script trying to set a PHP GLOBALS variable via URL
RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR]
# Block out any script trying to modify a _REQUEST variable via URL
RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2})
# Send all blocked request to homepage with 403 Forbidden error!
RewriteRule ^(.*)$ index.php [F,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !^/index.php
RewriteCond %{REQUEST_URI} (/|\.php|\.html|\.htm|\.feed|\.pdf|\.raw|/[^.]*)$  [NC]
RewriteRule (.*) index.php
RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
# Don't show directory listings for directories that do not contain an index file (index.php, default.asp etc.)
Options -Indexes
<Files http://english-1329329990.spampoison.com>
order allow,deny
allow from all
</Files>
deny from 108.
deny from 123.
deny from 180.
deny from 100.43.83.132

更新以显示附加的用户代理BOT检查代码

<?php
function botcheck(){
 $spiders = array(
   array('AdsBot-Google','google.com'),
   array('Googlebot','google.com'),
   array('Googlebot-Image','google.com'),
   array('Googlebot-Mobile','google.com'),
   array('Mediapartners','google.com'),
   array('Mediapartners-Google','google.com'),
   array('msnbot','search.msn.com'),
   array('bingbot','bing.com'),
   array('Slurp','help.yahoo.com'),
   array('Yahoo! Slurp','help.yahoo.com')
 );
 $useragent = strtolower($_SERVER['HTTP_USER_AGENT']);
 foreach($spiders as $bot) {
   if(preg_match("/$bot[0]/i",$useragent)){
     $ipaddress = $_SERVER['REMOTE_ADDR']; 
     $hostname = gethostbyaddr($ipaddress);
     $iphostname = gethostbyname($hostname);
     if (preg_match("/$bot[1]/i",$hostname) && $ipaddress == $iphostname){return true;}
   }
 }
}
if(botcheck() == false) {
  // User Login - Read Cookie values
     $username = $_COOKIE['username'];
     $password = $_COOKIE['password'];
     $radio_1 = $_COOKIE['radio_1'];
     $radio_2 = $_COOKIE['radio_2'];
     if (($username == 'm3s36G6S9v' && $password == 'S4er5h8QN2') || ($radio_1 == '2' && $radio_2 == '5')) {
     } else {
       $selected_username = $_POST['username'];
       $selected_password = $_POST['password'];
       $selected_radio_1 = $_POST['group1'];
       $selected_radio_2 = $_POST['group2'];
       if (($selected_username == 'm3s36G6S9v' && $selected_password == 'S4er5h8QN2') || ($selected_radio_1 == '2' && $selected_radio_2 == '5')) {
         setcookie("username", $selected_username, time()+3600, "/");
         setcookie("password", $selected_password, time()+3600, "/");
         setcookie("radio_1", $selected_radio_1, time()+3600, "/");
         setcookie("radio_2", $selected_radio_2, time()+3600, "/");
       } else {
        header("Location: login.html");
       }
     }
}
?>

我还在rss2html.php脚本的顶部添加了以下内容

// Checks to see if this script was called by the main site pages, (i.e. index.php or mobile.php) and if not, then sends to main page
   session_start();  
   if(isset($_SESSION['views'])){$_SESSION['views'] = $_SESSION['views']+ 1;} else {$_SESSION['views'] = 1;}
   if($_SESSION['views'] > 1) {header("Location: http://website.com/index.php");}

php htaccess robots.txt

— 萨米
source

您何时更新您的robots.txt？机械手读取更新的版本可能需要一些时间。

— ilanco 2012年

几天之前。我真正需要的是阻止Bots运行rss2html.php脚本。我尝试了会话和Cookie，但都失败了。

rss2html.php您的网站如何使用？通过PHP包含，重定向，Ajax ....？

— cHao 2012年

通过file_get_contents（）命令调用rss2html.php文件

file_get_contents...？似乎很奇怪。该文件在另一台服务器上还是什么？

— cHao 2012年

3

如果rss2html.php客户端没有直接使用它（也就是说，如果PHP一直在使用它而不是它作为链接之类的东西），那么忘记尝试阻止机器人。您真正要做的就是在主页中定义一个常量或其他内容，然后定义include另一个脚本。在另一个脚本中，检查常量是否已定义，并吐出403错误或空白页，或者如果未定义则抛出空白。

现在，为了使它起作用，您将不得不使用include而不是file_get_contents，因为后者将只是读入文件（如果您使用的是本地路径），或者在整个其他进程中运行（如果您使用重新使用网址）。但这就是Joomla之类的方法！用于防止直接包含脚本。并使用文件路径而不是URL，以便在尝试运行PHP代码之前尚未对其进行解析。

甚至更好的是rss2html.php从文档根目录下移出，但是有些主机很难做到。是否选择该选项取决于服务器/主机的设置。

— H
source

1

赵，谢谢当前，我正在重写代码以将file_get_contents转换为包括在内。

— 萨米2012年

4

您可以将脚本设置为根据bot提供的用户代理字符串抛出404错误-它们会迅速获得提示并让您独自一人。

if(isset($_SERVER['HTTP_USER_AGENT'])){
   $agent = $_SERVER['HTTP_USER_AGENT'];
}

if(preg_match('/^Googlebot/i',$agent)){
   http_response_code(301);
   header("HTTP/1.1 301 Moved Permanently");
   header("Location: http://www.google.com/");
   exit;
}

选择您的日志并以类似的方式拒绝Bingbot等-它不会停止请求，但可以节省一些带宽-让googlebot品尝它自己的药-Mwhahahahaha！

更新

查看您的代码，我认为您的问题在这里：

if (preg_match("/$bot[1]/i",$hostname) && $ipaddress == $iphostname)

如果它们是恶意机器人，则它们可能来自任何地方，请$ipaddress删除该子句并向其抛出301或404响应。

在盒子旁边思考

Googlebot从不接受cookie，因此无法存储它们。实际上，如果您要求所有用户使用Cookie，则可能会使该漫游器无法访问您的页面。
Googlebot不了解表单-或-javascript，因此您可以动态生成链接或让用户单击按钮以获取您的代码（带有适当的令牌）。

<a href="#" onclick="document.location='rss2html.php?validated=29e0-27fa12-fca4-cae3';">Rss2html.php</a>
- rss2html.php？validated = 29e0-27fa12-fca4-cae3-人类
- rss2html.php-机器人

— web_bod
source

1

没有您想的那么快。我已经看到机器人在同一页面上打了几个月不时出现，偶尔甚至在删除页面数年后也是如此。取决于机器人的行为方式以及行为方式。

— cHao 2012年

对人类访客有效的方法是index.php文件调用rss2html.php脚本。僵尸程序避免使用index.php脚本，而直接运行rss2html.php脚本。如果未通过index.php脚本访问该rss2html.php文件，该如何保护它？

尝试将rss2html.php重命名为其他名称，并更新index.php以引用新名称。

— BluesRockAddict 2012年

我尝试重命名文件，但几天后失败。如何将正在使用的代码添加到此线程？我想告诉你我尝试过的。

1

好的-我知道您可以拉的一个技巧:)-将rss2html.php脚本代码放到您的网站之外（将更新答案）

2

蜘蛛/机器人/客户端等的PHP限制/阻止网站请求。

在这里，我编写了一个PHP函数，该函数可以阻止不需要的请求以减少您的网站流量。适用于蜘蛛，机器人和烦人的客户。

客户端/机器人阻止程序

演示： http : //szczepan.info/9-webdesign/php/1-php-limit-block-website-requests-for-spiders-bots-clients-etc.html

码：

/* Function which can Block unwanted Requests
 * @return array of error messages
 */
function requestBlocker()
{
        /*
        Version 1.0 11 Jan 2013
        Author: Szczepan K
        http://www.szczepan.info
        me[@] szczepan [dot] info
        ###Description###
        A PHP function which can Block unwanted Requests to reduce your Website-Traffic.
        God for Spiders, Bots and annoying Clients.

        */

        # Before using this function you must 
        # create & set this directory as writeable!!!!
        $dir = 'requestBlocker/';

        $rules   = array(
                #You can add multiple Rules in a array like this one here
                #Notice that large "sec definitions" (like 60*60*60) will blow up your client File
                array(
                        //if >5 requests in 5 Seconds then Block client 15 Seconds
                        'requests' => 5, //5 requests
                        'sek' => 5, //5 requests in 5 Seconds
                        'blockTime' => 15 // Block client 15 Seconds
                ),
                array(
                        //if >10 requests in 30 Seconds then Block client 20 Seconds
                        'requests' => 10, //10 requests
                        'sek' => 30, //10 requests in 30 Seconds
                        'blockTime' => 20 // Block client 20 Seconds
                ),
                array(
                        //if >200 requests in 1 Hour then Block client 10 Minutes
                        'requests' => 200, //200 requests
                        'sek' => 60 * 60, //200 requests in 1 Hour
                        'blockTime' => 60 * 10 // Block client 10 Minutes
                )
        );
        $time    = time();
        $blockIt = array();
        $user    = array();

        #Set Unique Name for each Client-File 
        $user[] = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : 'IP_unknown';
        $user[] = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
        $user[] = strtolower(gethostbyaddr($user[0]));

        # Notice that I use files because bots do not accept Sessions
        $botFile = $dir . substr($user[0], 0, 8) . '_' . substr(md5(join('', $user)), 0, 5) . '.txt';


        if (file_exists($botFile)) {
                $file   = file_get_contents($botFile);
                $client = unserialize($file);

        } else {
                $client                = array();
                $client['time'][$time] = 0;
        }

        # Set/Unset Blocktime for blocked Clients
        if (isset($client['block'])) {
                foreach ($client['block'] as $ruleNr => $timestampPast) {
                        $elapsed = $time - $timestampPast;
                        if (($elapsed ) > $rules[$ruleNr]['blockTime']) {
                                unset($client['block'][$ruleNr]);
                                continue;
                        }
                        $blockIt[] = 'Block active for Rule: ' . $ruleNr . ' - unlock in ' . ($elapsed - $rules[$ruleNr]['blockTime']) . ' Sec.';
                }
                if (!empty($blockIt)) {
                        return $blockIt;
                }
        }

        # log/count each access
        if (!isset($client['time'][$time])) {
                $client['time'][$time] = 1;
        } else {
                $client['time'][$time]++;

        }

        #check the Rules for Client
        $min = array(
                0
        );
        foreach ($rules as $ruleNr => $v) {
                $i            = 0;
                $tr           = false;
                $sum[$ruleNr] = 0;
                $requests     = $v['requests'];
                $sek          = $v['sek'];
                foreach ($client['time'] as $timestampPast => $count) {
                        if (($time - $timestampPast) < $sek) {
                                $sum[$ruleNr] += $count;
                                if ($tr == false) {
                                        #register non-use Timestamps for File 
                                        $min[] = $i;
                                        unset($min[0]);
                                        $tr = true;
                                }
                        }
                        $i++;
                }

                if ($sum[$ruleNr] > $requests) {
                        $blockIt[]                = 'Limit : ' . $ruleNr . '=' . $requests . ' requests in ' . $sek . ' seconds!';
                        $client['block'][$ruleNr] = $time;
                }
        }
        $min = min($min) - 1;
        #drop non-use Timestamps in File 
        foreach ($client['time'] as $k => $v) {
                if (!($min <= $i)) {
                        unset($client['time'][$k]);
                }
        }
        $file = file_put_contents($botFile, serialize($client));


        return $blockIt;

}


if ($t = requestBlocker()) {
        echo 'dont pass here!';
        print_R($t);
} else {
        echo "go on!";
}

— 事实
source

1

您的网站很可能已被假Google bot编入索引。您可以尝试添加支票，并为所有伪造的Google Bot请求提供404服务。

以下文章介绍了如何验证Googlebot：http：//googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

另外，您还可以对照已知的伪造机器人检查记录：http : //stopmalvertising.com/security/fake-google-bots.html

— BluesRockAddict
source

谢谢，但是实际上我也尝试过，确定用户代理并将机器人发送到登录页面。这也失败了。

听起来您似乎没有抓住重点-依靠用户代理来确定漫游器的真实性是不够的。

— BluesRockAddict 2012年

1

首先，您应该真正确保从用户代理请求的任何页面（无论您使用的是哪种抓取工具）都将显示为静态页面。

带有条件或与您的http服务器等效的apache mod_rewrite。对于apache，如下所示：

RewriteCond  %{HTTP_USER_AGENT}  ^GoogleBot [OR]
RewriteCond  %{HTTP_USER_AGENT}  ^OtherAbusiveBot
RewriteRule  ^/$                 /static_page_for_bots.html  [L]

— 斯马西
source

谢谢，但是我不想完全阻止Google，MS Bing和Yahoo机器人，但是我想限制rss2html.php脚本文件上的直接点击。我只需要在rss2html.php脚本的开头添加一些内容，如果未通过index.php脚本访问它，它将阻止其运行。僵尸程序当前正在绕过index.php文件运行rss2html.php脚本。

— 萨米

这不会阻止它们..您只需提供php的缓存版本..对于服务器来说这很容易，因为它是一个较少的php实例/一个较少的apache子进程。=> Cost（静态文件）<Cost（php实例）。

— smassey，2012年

我将如何缓存页面？由于页面是RSS，是否会足够频繁地刷新缓存的页面以提供新鲜的数据？

— 萨米2012年

当然...写一个为您做的cronjob。如果您说如果它们将页面缓存1分钟，它们就会达到服务器10req / s的速度，那么您已经为服务器节省了599个php实例（当然包括数据库连接/查询）。每分钟一次比我所投票的要多得多持续时间：10/15分钟。

— smassey，2012年

1

要继续smassey的帖子，您可以设置几个条件：

RewriteCond  %{HTTP_USER_AGENT}  ^GoogleBot [OR]
RewriteCond  %{HTTP_USER_AGENT}  ^OtherAbusiveBot
RewriteRule  ^rss2html\.php$     /static.html  [L]

这样，机器人仍然可以访问您的页面，但不能访问您的页面。奇怪的是，（合法的）机器人没有遵守规则，您是否有任何引荐来源将机器人从其他来源（域名转发等）推送到您的网页

— 恩德里克斯
source

1

我已经使用http://perishablepress.com/blackhole-bad-bots/上的脚本解决了相同的问题。通过这种黑洞方法，我收集了一个恶意IP列表，然后使用.htaccess拒绝了它们。（这不是强制性的，因为脚本本身是禁止的。但是我需要通过避免对已知的有害IP进行php解析来减少服务器负载）三天内我的流量从每天5GB下降到300MB，这是安静的预期。

还要查看此页面以获取htaccess规则的完整列表，以阻止许多已知的垃圾机器人。http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html

— 尼沙德（Nishad TR）
source

0

// Checks to see if this script was called by the main site pages,
// (i.e. index.php or mobile.php) and if not, then sends to main page
session_start();  
if (isset($_SESSION['views'])) {$_SESSION['views'] = $_SESSION['views']+ 1;} else {$_SESSION['views'] = 1;}
if ($_SESSION['views'] > 1) {header("Location: http://website.com/index.php");}

该脚本不执行注释中的内容，实际上却完全相反。这将始终让机器人通过，因为当机器人请求您的脚本时，永远不会设置会话变量。它可能要做的就是防止合法请求（来自index.php或mobile.php）多次调用该脚本。

为了防止漫游器访问您的脚本，只有在实际设置了会话变量（或cookie）的情况下，才应允许访问。当然，假设（恶意）漫游器不接受Cookie。（我们知道真正的Googlebot并非如此。）

正如已经提到的，将rss2html.php放置在Web根目录（公共Web空间之外）上方将阻止bot直接访问脚本-但这会导致其他问题吗？或者，将其放在目录中并使用.htaccess保护该目录。还是您甚至可以保护.htaccess中的文件本身免受直接请求的侵害？

— 怀特先生
source

0

在Cloudflare上设置您的域（为此提供免费服务）。他们在攻击您的服务器之前在域级别阻止了恶意bot。大约需要20分钟，不必再花时间去看代码了。

我在所有站点和所有客户端站点上都使用此服务。他们基于包括利用项目Honey pot在内的多种技术来识别恶意bot。

— 布雷特·布米特
source

0

您需要做的是在服务器上为apache / nginx / email / ftp安装SSL证书。启用HSTS，还需要编辑ssl.conf文件，以便禁用SSLv2 SSLv3 TLSv1，并且不允许传入连接。以正确的方式加强服务器，机器人不会有任何问题。

— 罗伯特
source

我不清楚在这种情况下SSL / TLS正在解决什么问题。看来您在乞求问题并得出结果。请说明此解决方案如何控制该问题。

需要阻止机器人杀死我的Web服务器

robots.txt

.htaccess