使用domdocument循环遍历元素的所有子元素并提取文本内容 - looping through all children of element with domdocument and extract text-content

这是我试图解析的xml文件（odt文件）的结构：

<office:body>
    <office:text>
        <text:h text:style-name="P1" text:outline-level="2">Chapter 1</text:h>
            <text:p text:style-name="Standard">Lorem ipsum. </text:p>
            <text:h text:style-name="Heading3" text:outline-level="3">Subtitle 2</text:h>
                <text:p text:style-name="Standard"><text:span text:style-name="T5">10</text:span><text:span text:style-name="T6">:</text:span><text:s/>Text (100%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard">9.7:<text:s/>Text (97%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard"><text:span text:style-name="T9">9.1:</text:span><text:s/>Text (91%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                    <text:p text:style-name="Explanation">More furter informations.</text:p>
    </office:text>
</office:body>

使用XML阅读器，我是这样做的：

while ($reader->read()){ 
    if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'text:h') { 
        if ($reader->getAttribute('text:outline-level')=="2") $html .= '<h2>'.$reader->expand()->textContent.'</h2>';
    }
    elseif ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'text:p') { 
        if ($reader->getAttribute('text:style-name')=="Standard") {
            $html .= '<p>'.$reader->readInnerXML().'<p>';
        }
        else if {
            // Doing something different
        }
    }
}
echo $html;

现在我想对DOMDocument做同样的事情，但我需要一些语法方面的帮助。如何循环浏览办公室的所有子项：文本？在遍历所有节点时，我会通过if/else检查要做什么（text:h与text:p）。

我还需要将每个text:s（如果text:p中有这样的元素）替换为空白。。。

$reader = new DOMDocument();
$reader->preserveWhiteSpace  = false;
$reader->load('zip://content.odt#content.xml');
$body = $reader->getElementsByTagName( 'office:text' )->item( 0 );
foreach( $body->childNodes as $node ) echo $node->nodeName . PHP_EOL;

或者，循环浏览所有文本元素会更明智吗？如果是这样的话，问题仍然是如何做到这一点。

$elements = $reader->getElementsByTagName('text');
foreach($elements as $node){
    foreach($node->childNodes as $child) {
        echo $child->nodeName.': ';
        echo $child->nodeValue.'<br>';
        // check for type...
    }
}

使用DOMDocument最简单的方法之一是借助DOMXPath。

认真对待你的问题：

如何循环浏览办公室的所有子项：文本？

这可以表示为XPath表达式：

//office:text/child::node()

然而，你在这里使用了一个有点错误的措辞。不仅是所有的孩子，还有孩子们的孩子等等——这就是所有的后代：

//office:text/descendant::node()

或者使用缩写语法：

//office:text//node()

比较：XPath获取所有子节点，而不是父节点

为了在PHP中循环，您需要注册office前缀的命名空间，然后使用foreach循环xpath结果：$xpath=新的DOMXPath（$reader）；$xpath->registerNamespace（'office'，$xml_namespace_uri_of_of_office_namespace）；

$descendants = $xpath->query('//office:text//node()');
foreach ($descendants as $node) {
    // $node is a DOMNode as of DOMElement, DOMText, ...
}

XPath不是一般的，但在PHP的基于libxml的库中确实按文档顺序返回节点。这就是你要找的订单。

比较：XPath查询结果顺序