如何在所有链接标签之间抓取内容，如<；a href="">；删除此</a>；在页面上 - How to scrape content betweeen all link tags like <a href="">SCRAPE THIS</a> on a page?

How to scrape content betweeen all link tags like <a href="">SCRAPE THIS</a> on a page?

我正在尝试抓取网站的链接文本，即scrape THIS。我想对页面上的所有链接都这样做。到目前为止，我有这个：

<?php
$target_url = "SITE I WANT TO SCRAPE";
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/text()");
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    echo "<br />Link stored: $href";
}
?>

我对这东西还很陌生，不知道自己做错了什么？

谢谢！

在for循环中，$href不是字符串。它实际上是一个DOMText节点。为了将其用作字符串，您需要访问其nodeValue属性。

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    echo "<br />Link stored: $href->nodeValue";
}

尝试：

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/text()");
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i)->textContent;
    echo "<br />Link stored: $href";
}