这是我试图构建的简单网络爬虫
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = @file_get_contents($url);
$regexp = " <a's[^>]*href=('"??)([^'" >]*?)''1[^>]*>(.*)<'/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
当我尝试运行脚本时,将$to_crawl变量设置为以文件名结尾的url,例如:"facebook.com/about",它的工作原理,但由于某种原因,它只是echo's nothing当链接以'.php'文件名结束。有人能帮帮我吗?
要获取所有链接及其内部文本,您可以像这样使用DOMDocument
:
$dom = new DOMDocument;
@$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[@href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
参见IDEONE demo