我正在尝试编写一个php脚本来抓取网站并在数据库中保留一些元素。
这是我的问题:一个网页是这样写的:
<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
我只想得到带有有趣文本的h2和p,而不是p class="one_class"。
我尝试了这个php代码:
<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
$pIsOneClass = PIsOneClass($html, $p);
if($pIsOneClass == false)
{
echo $p->outertext;
$h2 = $html->find("h2", $numberP);
echo $h2->outertext;
$numberP++;
}
}
?>
函数PIsOneClass($html,$p)是:
<?php
function PIsOneClass($html, $p)
{
foreach($html->find("p.one_class") as $p_one_class)
{
if($p == $p_one_class)
{
return true;
}
}
return false;
}
?>
它不起作用,我理解为什么,但我不知道如何解决它。
我们怎么能说"我想要每一个在两个h2之间的p?"
太多了!
使用XPath,这项任务更容易,因为您要抓取多个元素,并且希望保持源代码的有序性。您可以使用PHP的DOM库,其中包括DOMXPath,来查找和过滤您想要的元素:
$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';
# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);
# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(@class="one_class")]');
# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}
输出:
node h2, value: The title 1
node p, value: Some interesting text
node h2, value: The title 2
node p, value: Some interesting text
node p, value: Some other interesting text
node h2, value: The title 3
node p, value: Some interesting text
正如您所看到的,源文本保持有序,并且很容易消除您不想要的节点。
从simpleHTML dom手动
[attribute=value]
将具有指定属性的元素与某个值相匹配。或
[!attribute]
匹配没有指定属性的元素。