如何使用php检测搜索引擎机器人?


Answers:


74

是蜘蛛网名称搜索引擎目录

然后,您$_SERVER['HTTP_USER_AGENT'];可以检查代理是否被称为蜘蛛。

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}

如果((eregi(“ yahoo”,$ this-> USER_AGENT))&&(eregi(“ slurp”,$ this-> USER_AGENT))){$ this-> Browser =“ Yahoo! Slurp”; $ this-> Type =“ robot”; }可以正常工作吗?
很棒

3
因为strpos可以返回0(位置),所以strstr失败时返回FALSE,因此如果在末尾添加!== false检查,则可以使用strpos。
奥拉维尔Waage

2
嗯,失败了也strpos回来FALSE了。不过,它更快,更高效(没有预处理,也没有O(m)存储)。
戴蒙2014年

6
假冒的用户代理呢?

2
而且,如果有人可以使用假名更改其用户代理并将其命名为“ Googlebot”怎么办?我认为检查ip范围更值得信赖!
Mojtaba Rezaeian 2015年

235

我使用下面的代码似乎运行良好:

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

更新2017年6月6日 https://support.google.com/webmasters/answer/1061943?hl=zh_CN

增加了媒体合作伙伴


2
这是否假设机器人将自己显示出来?
Jeromie Devera 2014年

2
否决,用户代理可以在Chrome设置中更改,火狐,
barwnikk

24
是的,可以更改用户代理,但是如果有人对其进行更改以包含“ bot”,“ crawl”,“ slurp”或“ spider”,它们将知道会发生什么。它还取决于实用程序。我不会使用它来剥离所有CSS,但不会使用它来存储cookie,忽略位置日志记录或跳过登录页面。
JonShipman

2
难道没有人同意我这是广泛匹配的一种方法吗?
大安2015年

我已经使用您的功能超过1天了,而且似乎可以正常使用。但是我不确定。如何发送测试机器人以进行测试?
FarrisFahad

19

检查$_SERVER['HTTP_USER_AGENT']此处列出的一些字符串:

http://www.useragentstring.com/pages/useragentstring.php

或更具体地说,适用于爬虫:

http://www.useragentstring.com/pages/useragentstring.php?typ=搜寻器

如果您想-说出-记录最常见的搜索引擎抓取工具的访问次数,则可以使用

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

16

您可以通过以下功能查看它是否是搜索引擎:

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

然后您可以像这样使用它:

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

2
我认为此列表已过时,例如,我看不到“ slurp”,它是Yahoo,它是蜘蛛help.yahoo.com/kb/SLN22600.html
Daan

11

我正在用它来检测机器人:

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

另外,我使用白名单来阻止有害的机器人:

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

然后,有害的漫游器(=假阳性用户)便能够解决验证码问题,从而可以在24小时内解除封锁。而且由于没人能解决这个验证码,所以我知道它不会产生假阳性。因此,僵尸程序检测似乎可以正常工作。

注意:我的白名单基于Facebooks robots.txt


您忘记了)第一段代码的结尾。
卢多-纪录

10

因为任何客户端都可以将用户代理设置为他们想要的,所以寻找“ Googlebot”,“ bingbot”等仅仅是工作的一半。

第二部分是验证客户端的IP。在过去,这需要维护IP列表。您在网上找到的所有列表均已过时。顶级搜索引擎正式支持通过DNS进行验证,如Google https://support.google.com/webmasters/answer/80553和Bing http://www.bing.com/webmaster/help/how-to-verify所述-bingbot-3905dc26

首先,对客户端IP执行反向DNS查找。对于Google,此主机名在googlebot.com下,对于Bing,其主机名在search.msn.com下。然后,由于有人可以在其IP上设置这样的反向DNS,因此您需要使用该主机名上的正向DNS查找进行验证。如果生成的IP与该网站的访问者之一相同,则可以确定它是该搜索引擎的爬网程序。

我已经用Java编写了一个库,可以为您执行这些检查。随时将其移植到PHP。在GitHub上: https //github.com/optimaize/webcrawler-verifier


1
使用用户代理字符串的所有其他答案仅在中间。哇。
mlissner,2017年

1
关于用户代理检查的评论很多,仅占检查的一半。的确如此,但是请记住,进行完整的DNS和反向DNS查找会对性能产生巨大影响。这完全取决于您需要获得支持用例的确定性级别。这是为了100%确定性而以性能为代价。您必须决定哪种情况适合您的情况(以及最好的解决方案)。
布雷迪·爱默生

没有“巨大的性能影响”。首先,仅对识别为搜索引擎的访问者执行反向dns查找。所有人类都根本不受影响。然后,每个IP仅执行一次查找。结果被缓存。搜索引擎会在很长一段时间内一直使用相同的IP范围,并且通常只用一个或几个IP来访问一个站点。另外:您可以延迟执行验证。让第一个请求通过,然后进行背景验证。如果否定,则阻止后续请求。(我不建议这样做,因为现在收割机拥有很大的IP池...)
Fabian Kessler,

是否有一些用PHP编写的模拟库?
userlond

8

我使用此功能...正则表达式的一部分来自prestashop,但我向其中添加了更多Bot。

    public function isBot()
{
    $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
    $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
    $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

    return $isBot;
}

无论如何,请注意某些机器人使用诸如用户代理之类的浏览器来伪造其身份
(我的网站上有很多俄罗斯ip都具有此行为)

大多数漫游器的一个显着特征是它们不携带任何cookie,因此没有会​​话附加到它们。
(我不确定如何,但这肯定是跟踪它们的最佳方法)




4
 <?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
    $lastupdated = date("Ymd", filemtime(FILE_BOTS));
    if ($lastupdated != date("Ymd")) {
        $lists = array(
        'http://labs.getyacg.com/spiders/google.txt',
        'http://labs.getyacg.com/spiders/inktomi.txt',
        'http://labs.getyacg.com/spiders/lycos.txt',
        'http://labs.getyacg.com/spiders/msn.txt',
        'http://labs.getyacg.com/spiders/altavista.txt',
        'http://labs.getyacg.com/spiders/askjeeves.txt',
        'http://labs.getyacg.com/spiders/wisenut.txt',
        );
        foreach($lists as $list) {
            $opt .= fetch($list);
        }
        $opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
        $fp =  fopen(FILE_BOTS,"w");
        fwrite($fp,$opt);
        fclose($fp);
    }
    $ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
    $ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
    $agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
    $host = strtolower(gethostbyaddr($ip));
    $file = implode(" ", file(FILE_BOTS));
    $exp = explode(".", $ip);
    $class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
    $threshold = CLOAKING_LEVEL;
    $cloak = 0;
    if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
        $cloak++;
    }
    if (stristr($file, $class)) {
        $cloak++;
    }
    if (stristr($file, $agent)) {
        $cloak++;
    }
    if (strlen($ref) > 0) {
        $cloak = 0;
    }

    if ($cloak >= $threshold) {
        $cloakdirective = 1;
    } else {
        $cloakdirective = 0;
    }
}
?>

那将是隐藏蜘蛛的理想方法。它来自名为[YACG]的开源脚本- http://getyacg.com

需要一些工作,但绝对是要走的路。


2

为此我做了一个很好的功能

function is_bot(){

        if(isset($_SERVER['HTTP_USER_AGENT']))
        {
            return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
        }

        return false;
    }

这涵盖了所有可能的漫游器,搜索引擎等的99%。


1

我正在使用此代码,非常好。您将很容易知道用户代理访问了您的站点。这段代码正在打开一个文件,并将user_agent写入该文件。您可以每天访问yourdomain.com/useragent.txt并了解新的user_agents并将它们置于if子句中,以检查该文件的每一天。

$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
    // if not meet the conditions then
    // do what you need

    // here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
    if($user_agent!=""){
        $myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
        fwrite($myfile, $user_agent);
        $user_agent = "\n";
        fwrite($myfile, $user_agent);
        fclose($myfile);
    }
}

这是useragent.txt的内容

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

您的(if_clause)字符串将是什么?mozilla / 5.0(iphone; cpu iphone os 9_3就像mac os x)applewebkit / 601.1.46(khtml像gecko)版本/9.0 mobile / 13e198 safari / 601.1
普通Joe

1

100%工作机器人检测器。它在我的网站上成功运行。

function isBotDetected() {

    if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
    ) {
        return true; // 'Above given bots detected'
    }

    return false;

} // End :: isBotDetected()

1

如果您确实需要检测GOOGLE引擎机器人,则永远不要依赖“ user_agent”或“ IP”地址,因为“ user_agent”可以更改,并与Google在“ 验证Googlebot”中所说的一致

验证Googlebot为来电者:

1. 使用host命令在日志访问IP地址上运行反向DNS查找。

2.验证域名是否在googlebot.comgoogle.com中

3. 对检索到的域名使用host命令在步骤1中检索到的域名上运行正向DNS查找。验证它是否与日志中的原始访问IP地址相同。

这是我测试过的代码:

<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 ) 
{
//add your code
}

?>

在此代码中,我们检查“主机名”,该主机名应在“主机名”的末尾包含“ googlebot.com”或“ google.com”,这对于检查确切的域而不是子域非常重要。我希望你喜欢 ;)


0

对于谷歌,我正在使用这种方法。

function is_google() {
    $ip   = $_SERVER['REMOTE_ADDR'];
    $host = gethostbyaddr( $ip );
    if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {

        $forward_lookup = gethostbyname( $host );

        if ( $forward_lookup == $ip ) {
            return true;
        }

        return false;
    } else {
        return false;
    }

}

var_dump( is_google() );

积分:https : //support.google.com/webmasters/answer/80553


-1
function bot_detected() {

  if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
    return true;
  }
  else{
    return false;
  }
}
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.