XPath基于条件组合节点文本


XPath Combine Node Texts Based On Condition

我有以下HTML文档:

<div>
  <span>Line 1</span>
  <p>
    <span class='inline'>This</span>
    text should 
    <span class='inline'>be in</span>
    one 
    <span class='inline'>line</span>
    <span class='inline'>all together</span>
  </p>
  <em>
    <span class='inline'>This</span>
    line
    <span class='inline'>too</span>
  </em>
  <a href="#">Line 4</a>
  <div>
    <p>
      <span class='inline'>This fourth</span>
      line
      <span class='inline'>too</span>
    </p>
  </div>
  <script type="text/javascript">//...</script>
  <b></b>
</div>

应提取的文本:

Line 1
This text should be in one line all together
This line too
Line 4
This fourth line too

目前我正在使用//div//descendant::*[not(self::script)]/text()[string-length() > 0]来提取文本。

这导致了以下结果:

Line 1
This
text should
be in
one
line
all together
This
line
too
Line 4
This fourth
line
too

如果使用类"inline",我如何组合文本?或者,如果在子节点中发现类"inline",我如何使用父节点的文本?

请注意,这是一个示例:p和em标记可能会有所不同

也许你看错了地方。我突然想到,你正在寻找div(这里也是根)元素的任何子元素的文本内容,但要寻找script标记,如果为空:

/div/*[name() != "script" and string-length(normalize-space())]

我的xpath示例也进行空间规范化。例如,如果<b></b>将是<b> </b>,或者有一些换行符,那么它也将被限定为空。

读取DOMNode::$textContent并用它规范化空间会产生以下结果:

string(6) "Line 1"
string(44) "This text should be in one line all together"
string(13) "This line too"
string(6) "Line 4"
string(20) "This fourth line too"

下面是一个快速的PHP示例代码,演示了这一点:

<?php
$buffer = <<<XML
<div>
  <span>Line 1</span>
  <p>
    <span class='inline'>This</span>
    text should
    <span class='inline'>be in</span>
    one
    <span class='inline'>line</span>
    <span class='inline'>all together</span>
  </p>
  <em>
    <span class='inline'>This</span>
    line
    <span class='inline'>too</span>
  </em>
  <a href="#">Line 4</a>
  <div>
    <p>
      <span class='inline'>This fourth</span>
      line
      <span class='inline'>too</span>
    </p>
  </div>
  <script type="text/javascript">//...</script>
  <b></b>
</div>
XML;
$xml = simplexml_load_string($buffer);
$result = $xml->xpath('/div/*[name() != "script" and string-length(normalize-space())]');
foreach ($result as $node) {
    $text = dom_import_simplexml($node)->textContent;
    $text = preg_replace(['('s+)u', '(^'s|'s$)u'], [' ', ''], $text);
    var_dump($text);
}