PHP网络爬虫不抓取.PHP文件 - PHP Web Crawler doesn't crawl .php files

PHP Web Crawler doesn't crawl .php files

本文关键字：PHP 文件抓取爬虫网络 | 更新日期: 2023-09-27

这是我试图构建的简单网络爬虫

<?php
    $to_crawl = "http://samplewebsite.com/about.php";
    function get_links($url)
    {
        $input = @file_get_contents($url);
        $regexp = " <a's[^>]*href=('"??)([^'" >]*?)''1[^>]*>(.*)<'/a> ";
        preg_match_all("/$regexp/siU", $input, $matches);
        $l = $matches[2];
        foreach ($l as $link) {
            echo $link."</br>";
        }
    }

    get_links($to_crawl);

?>

当我尝试运行脚本时，将$to_crawl变量设置为以文件名结尾的url，例如:"facebook.com/about"，它的工作原理，但由于某种原因，它只是echo's nothing当链接以'.php'文件名结束。有人能帮帮我吗?

要获取所有链接及其内部文本，您可以像这样使用DOMDocument:

$dom = new DOMDocument;
@$dom->loadHTML($input);                    // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[@href]');          // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
    $result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);

参见IDEONE demo