我想获取带有maindiv的内容,而无需更多标签,例如我想从给定的代码中废弃"哈佛朝圣者医疗保健公司为您带来的冬季溜冰鞋,提供白天和晚上的公共滑冰,是今年冬天机舱发烧的完美补救措施。我正在使用带有简单html dom的xpath,这是我的代码
foreach($dom->find('//*[@id="main"]/text()[1]') as $element){
$details=$element;
}
但它既没有得到任何元素,也没有进入foreach。你能给我任何解决方案吗?
<div id="main">
<div>a</div>
<div>b</div>
<div>c</div>
<a name="abc"></a>Winter Skate brought to you by Harvard Pilgrim HealthCare, offering day and evening public skating, is the perfect remedy to cabin fever this winter.<br />
<br />
A fun and affordable activity for parents with children, Winter Skate is also an ideal lunch break getaway and a romantic addition to a dinner date at Patriot Place. <br />
<br />
The 60-by-140-foot, refrigerated ice surface is designed specifically for recreational skating and the professional surface is large enough to accommodate beginners and experts alike.<br />
<br />
On-site skate rentals, concessions and bathrooms are available and parking is free.<br />
<br />
<br />
<b>Concessions</b><br />
Dunkin Donuts will be on-site with coffee, hot chocolate and donuts available for purchase. Additionally, Patriot Place features 16 dining and quick service restaurants including: Bar Louie, Baskin Robbins, Blue Fin Lounge, CBS Scene, Davio’s, Five Guys Burgers, Godiva, Olive Garden, Qdoba, Red Robin, Skipjack’s, Studio 3, Tastings Wine Bar & Bistro, Tavolino Pizza Gourmet, Twenty8 Food & Spirits.<br />
<br />
NOTE: Hours may occasionally vary due to inclement weather, Patriots home games, or pre-scheduled private events – please check back or call 508-203-2100<br><br>
<a name='hours' class='ranchor'></a>
</div>
SimpleHtmlDom 没有实现官方的 W3C DOM API。它使用 CSS 选择器,而不是 XPath。CSS 选择器不能用于选择文本节点,它们只匹配元素节点。
您可以使用 PHP 标准的本机 DOM 扩展:
$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
var_dump(
$xpath->evaluate('string(//*[@id="main"]/text()[normalize-space() != ""][1])')
);
输出:
string(149) "Winter Skate brought to you by Harvard Pilgrim HealthCare, offering day and evening public skating, is the perfect remedy to cabin fever this winter."
[normalize-space() != ""]
是筛选仅包含空格的节点的条件。
string()
将结果列表中的第一个节点强制转换为字符串,并避免了循环的需要。
DOMDocument::loadHTML()
和DOMDocument::loadHTMLFile()
尝试修复无效的 html 源。例如,如果它们不存在,则添加html
和body
。这可以更改 HTML,因此最好将 HTML 保存回字符串以获取新结构:
$html = <<<'HTML'
<div id="main" class="one" class="two">
<div>a</div>
<div>b</div>
<div>c</div>
<a name="abc"></a>Winter Skate brought to you by ...
HTML;
$dom = new DOMDocument();
@$dom->loadHtml($html);
echo $dom->saveHtml();
输出:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div id="main" class="one">
<div>a</div>
<div>b</div>
<div>c</div>
<a name="abc"></a>Winter Skate brought to you by ...</div></body></html>
此外,@ 会阻止来自 HTML 解析的错误和警告。这在大多数情况下都有效,但更好的方法是使用 libxml 函数并处理/记录错误:
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHtml($html);
var_dump(libxml_get_errors());
输出:
array(1) {
[0]=>
object(LibXMLError)#2 (6) {
["level"]=>
int(2)
["code"]=>
int(42)
["column"]=>
int(39)
["message"]=>
string(26) "Attribute class redefined
"
["file"]=>
string(0) ""
["line"]=>
int(1)
}
}
如果它报告了一个空的源代码,你需要检查 DOMDocument::loadHTMLFile 是否可以获取它,尝试使用 file_get_contents() 获取它。