简单PHP web爬网程序中出现HTTP 500错误 - HTTP 500 error in simple PHP web crawler

HTTP 500 error in simple PHP web crawler

本文关键字：HTTP 错误程序 PHP web 简单 | 更新日期: 2023-09-27

我正在尝试运行一个指向一个url的网络爬虫，它没有链接，代码看起来很好；但是，我得到了一个http 500错误。

它对抓取的内容所做的就是回显它

知道为什么吗？

<?php
error_reporting( E_ERROR );
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
$domains = array();
$urls = array();
function crawl( $url )
{
    global $domains, $urls;
    $parse = parse_url( $url );
    $domains[ $parse['host'] ]++;
    $urls[] = $url;
    $content = file_get_contents( $url );
    if ( $content === FALSE ){
        echo "Error: No content";
        return;
}
    $content = stristr( $content, "body" );
    preg_match_all( '/http:'/'/[^ "'']+/', $content, $matches );
    // do something with content.
    echo $content;
    foreach( $matches[0] as $crawled_url ) {
        $parse = parse_url( $crawled_url );
        if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
            sleep( 1 );
            crawl( $crawled_url );
        }
    }
}
crawl(http://the-irf.com/hello/hello6.html);
?>

替换：

crawl(http://the-irf.com/hello/hello6.html);

带有：

crawl('http://the-irf.com/hello/hello6.html');

URL是一个文本字符串，因此必须用引号括起来。

关于您的stristr:问题

返回从第一次出现针开始到结束的所有草垛(包括第一次出现的针(。

所以，你的代码：

$content = stristr( $content, "body" );

将返回从第一次出现CCD_ 2开始并包括第一次出现的CCD_。