PHP网络爬虫不抓取.PHP文件


PHP Web Crawler doesn't crawl .php files

这是我试图构建的简单网络爬虫

<?php
    $to_crawl = "http://samplewebsite.com/about.php";
    function get_links($url)
    {
        $input = @file_get_contents($url);
        $regexp = " <a's[^>]*href=('"??)([^'" >]*?)''1[^>]*>(.*)<'/a> ";
        preg_match_all("/$regexp/siU", $input, $matches);
        $l = $matches[2];
        foreach ($l as $link) {
            echo $link."</br>";
        }
    }

    get_links($to_crawl);

?>

当我尝试运行脚本时,将$to_crawl变量设置为以文件名结尾的url,例如:"facebook.com/about",它的工作原理,但由于某种原因,它只是echo's nothing当链接以'.php'文件名结束。有人能帮帮我吗?

要获取所有链接及其内部文本,您可以像这样使用DOMDocument:

$dom = new DOMDocument;
@$dom->loadHTML($input);                    // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[@href]');          // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
    $result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);

参见IDEONE demo