我有以下xml文档:
<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
<li>Bulleted style text
<ul>
<li>
<paragraph>1.Sub Bulleted style text</paragraph>
</li>
</ul>
</li>
</ul>
<ul>
<li>Bulleted style text <strong>bold</strong>
<ul>
<li>
<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
</li>
</ul>
</li>
</ul>
我需要删除子项目符号文本前面的数字。 1。和2。在给定的示例中
这是我到目前为止的代码:
<?php
class MyDocumentImporter
{
const AWKWARD_BULLET_REGEX = '/(^['s]?['d]+['.]{1})/i';
protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';
protected $dom;
public function processListsText( $loop = null ){
$this->dom = new DomDocument('1.0', 'UTF-8');
$this->dom->loadXML($this->xml_string);
if(!$loop){
//get all the li tags
$li_set = $this->dom->getElementsByTagName('li');
}
else{
$li_set = $loop;
}
foreach($li_set as $li){
//check for child nodes
if(! $li->hasChildNodes() ){
continue;
}
foreach($li->childNodes as $child){
if( $child->hasChildNodes() ){
//this li has children, maybe a <strong> tag
$this->processListsText( $child->childNodes );
}
if( ! ( $child instanceof DOMElement ) ){
continue;
}
if( ( $child->localName != 'paragraph') || ( $child instanceof DOMText )){
continue;
}
if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
continue;
}
$clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);
//set node to empty
$child->nodeValue = '';
//add updated content to node
$child->appendChild($child->ownerDocument->createTextNode($clean_content));
//$xml_output = $child->parentNode->ownerDocument->saveXML($child);
//var_dump($xml_output);
}
}
}
}
$importer = new MyDocumentImporter();
$importer->processListsText();
我可以看到的问题是,$child->textContent
返回节点的纯文本内容,并剥离额外的子标记。所以:
<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
是
<paragraph>Sub Bulleted bold</paragraph>
不再使用<strong>
标签。
我有点难住了…任何人都可以看到一种方法来剥离不需要的字符,并保留"内部子"<strong>
标签?
标签不一定总是<strong>
,也可以是超链接<a href="#">
,或<emphasize>
。
假设您的XML可以解析,那么您可以使用XPath使您的查询更容易:
$xp = new DOMXPath($this->dom);
foreach ($xp->query('//li/paragraph') as $para) {
$para->firstChild->nodeValue = preg_replace('/^'s*'d+.'s*/', '', $para->firstChild->nodeValue);
}
它在第一个文本节点上进行文本替换,而不是整个标记内容。
您重置了它的整个内容,但是您想要的只是更改第一个文本节点(请记住文本节点也是节点)。您可能希望查找xpath //li/paragraph/text()[position()=1]
,并处理/替换DOMText节点,而不是整个段落内容。
$d = new DOMDocument();
$d->loadXML($xml);
$p = new DOMXPath($d);
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){
$text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text);
}