我有以下HTML文档:
<div>
<span>Line 1</span>
<p>
<span class='inline'>This</span>
text should
<span class='inline'>be in</span>
one
<span class='inline'>line</span>
<span class='inline'>all together</span>
</p>
<em>
<span class='inline'>This</span>
line
<span class='inline'>too</span>
</em>
<a href="#">Line 4</a>
<div>
<p>
<span class='inline'>This fourth</span>
line
<span class='inline'>too</span>
</p>
</div>
<script type="text/javascript">//...</script>
<b></b>
</div>
应提取的文本:
Line 1
This text should be in one line all together
This line too
Line 4
This fourth line too
目前我正在使用//div//descendant::*[not(self::script)]/text()[string-length() > 0]
来提取文本。
这导致了以下结果:
Line 1
This
text should
be in
one
line
all together
This
line
too
Line 4
This fourth
line
too
如果使用类"inline",我如何组合文本?或者,如果在子节点中发现类"inline",我如何使用父节点的文本?
请注意,这是一个示例:p和em标记可能会有所不同
也许你看错了地方。我突然想到,你正在寻找div(这里也是根)元素的任何子元素的文本内容,但要寻找script标记,如果为空:
/div/*[name() != "script" and string-length(normalize-space())]
我的xpath示例也进行空间规范化。例如,如果<b></b>
将是<b> </b>
,或者有一些换行符,那么它也将被限定为空。
读取DOMNode::$textContent
并用它规范化空间会产生以下结果:
string(6) "Line 1"
string(44) "This text should be in one line all together"
string(13) "This line too"
string(6) "Line 4"
string(20) "This fourth line too"
下面是一个快速的PHP示例代码,演示了这一点:
<?php
$buffer = <<<XML
<div>
<span>Line 1</span>
<p>
<span class='inline'>This</span>
text should
<span class='inline'>be in</span>
one
<span class='inline'>line</span>
<span class='inline'>all together</span>
</p>
<em>
<span class='inline'>This</span>
line
<span class='inline'>too</span>
</em>
<a href="#">Line 4</a>
<div>
<p>
<span class='inline'>This fourth</span>
line
<span class='inline'>too</span>
</p>
</div>
<script type="text/javascript">//...</script>
<b></b>
</div>
XML;
$xml = simplexml_load_string($buffer);
$result = $xml->xpath('/div/*[name() != "script" and string-length(normalize-space())]');
foreach ($result as $node) {
$text = dom_import_simplexml($node)->textContent;
$text = preg_replace(['('s+)u', '(^'s|'s$)u'], [' ', ''], $text);
var_dump($text);
}