如何使用PHP检测爬虫/蜘蛛?
我目前正在从事一个项目,我需要跟踪每个爬虫的访问。
我知道你应该使用HTTP_USER_AGENT但我不确定如何为此目的格式化代码,我知道用户代理可以很容易地更改,所以我也想知道是否可以添加更多参数以避免欺骗?
我正在尝试执行的示例代码..
<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>
谢谢
根据验证Googlebot:
您可以使用反向 DNS 查找来验证访问您服务器的漫游器是否确实是 Googlebot(或其他 Google 用户代理),验证该名称是否位于 googlebot.com 域中,然后使用该 googlebot 名称执行正向 DNS 查找。如果您担心垃圾邮件发送者或其他麻烦制造者在自称是 Googlebot 的同时访问您的网站,这将非常有用。
例如:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google 不会发布公开的 IP 地址列表,以便网站站长列入白名单。这是因为这些 IP 地址范围可能会更改,从而给任何对其进行硬编码的网站管理员带来问题。识别Googlebot访问的最佳方法是使用用户代理(Googlebot)。
您可以执行反向 DNS 查找:
function validateGoogleBotIP($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
return preg_match('/'.google(bot)?'.com$/i', $hostname);
}
if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
echo 'It is ACTUALLY google';
} else {
echo 'Someone''s faking it!';
}
} else {
echo 'Nothing to do with Google';
}
100% 在我的网站上工作以检测机器人、爬虫、蜘蛛和复印机。
function isBotDetected() {
if ( !empty($_SERVER['HTTP_USER_AGENT']) and preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC'/0'.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|'-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID'/0'.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD'-Explorer|KIT'-Fireball|ko_yappo_robot|label'-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC'-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan'-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru'/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search'-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp'-info'-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler'/2'.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot'/1'.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww'-perl|verticrawl|Victoria|void'-bot|Voyager|VWbot_K|wapspider|WebBandit'/1'.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()
要正确验证访问者是否来自搜索引擎,您需要的不仅仅是检查容易被欺骗的用户代理。
正确的方法是查找IP的主机名,并快速检查它是否与我们知道搜索引擎爬虫使用的任何主机名匹配。
如果主机名与其中一个已知的爬网程序匹配,则查找主机名的 IP 并查看两者是否匹配。如果其中一个步骤失败,则您有一个虚假的搜索引擎爬虫正在访问。
以下函数接受 IP 并遵循前面提到的步骤。它识别百度,必应,谷歌,雅虎和Yandex。
/**
* Validate a crawlers IP against the hostname
* Warning - str_ends_with() requires PHP 8
*
* @param mixed $ip
* @return boolean
*/
function validate_crawler_ip( $testip ) {
$hostname = strtolower( gethostbyaddr( $testip ) );
$valid_host_names = array(
'.crawl.baidu.com',
'.crawl.baidu.jp',
'.google.com',
'.googlebot.com',
'.crawl.yahoo.net',
'.yandex.ru',
'.yandex.net',
'.yandex.com',
'.search.msn.com',
);
$valid_ip = false;
foreach ( $valid_host_names as $valid_host ) {
// Using string_ends_with() to make sure the match is in the -end- of the hostname (to prevent fake matches)
if ( str_ends_with( $hostname, $valid_host ) ) { // PHP 8 function
$returned_ip = gethostbyname( $hostname );
if ( $returned_ip === $testip ) {
// The looked up IP from the host matches the incoming IP - we have validated!
return true;
}
}
}
// No match - not valid crawler
return false;
}