正则表达式,用于从RSS文件中删除标记和内容


Regular expression to remove tags and contents from RSS file

这是我的RSS文件的示例结构:

<item>
 <title>My Title</title>
 <link>http://www.link.com</link>
 <description>The description</description>
 <author>Blah Blah</author>
 <pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
 <media:content url="myimage.jpg">
  <media:title>sdafsd</media:title>
 </media:content>
 <position>1</position>
</item>

如何使用PHP正则表达式从文件中完全删除author标记及其内容、整个media:content标记及其内容以及position标记及其内容?

谢谢!

不要使用Regex来解析HTML/XML,有非常好的解析器:

<?php
$xml = <<<XML
<item>
    <title>My Title</title>
    <link>http://www.link.com</link>
    <description>The description</description>
    <author>Blah Blah</author>
    <pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
    <media:content url="myimage.jpg">
        <media:title>sdafsd</media:title>
    </media:content>
    <position>1</position>
</item>
XML;
$dom = new DOMDocument();
//DOMDocument throws warnings when the XML is invalid, we don't care.
//Though in this case, the media: namespace would be ignored because it's not defined.
@$dom->loadXML($xml);
$document = $dom->documentElement;
//Find the elements you want to remove
$author = $document->getElementsByTagName("author")->item(0);
$content = $document->getElementsByTagName("content")->item(0);
//And remove them.
$document->removeChild($author);
$document->removeChild($content);
//Output the resulting XML.
echo $dom->saveXML();

我之前的回答被删除了,我应该把它作为注释添加。这是DomDocument做你想做的事情的另一种选择:

<?php
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>bla</title>
    <link>bla</link>
    <description>A description</description>
    <language>en-us</language>
    <item xmlns:media="http://search.yahoo.com/mrss/">
     <title>My Title</title>
     <link>http://www.link.com</link>
     <description>The description</description>
     <author>Blah Blah</author>
     <pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
     <media:content url="myimage.jpg">
      <media:title>sdafsd</media:title>
     </media:content>
     <position>1</position>
    </item>
  </channel>
</rss>
XML;
$doc = new DOMDocument();
$doc->loadXml( $xml );
foreach( $doc->getElementsByTagName( 'item' ) as $item ) {
    $item->removeChild( $item->getElementsByTagName( 'author' )->item( 0 ) );
    $item->removeChild( $item->getElementsByTagName( 'position' )->item( 0 ) );
            $item->removeChild( $item->getElementsByTagName( 'content' )->item( 0 ) );
}
var_dump( $doc->saveXml( ) );
   $content = file_get_contents($file_name)
$xmlElem = 'author'
$content = preg_replace('#<' . $xmlElem . '(?:'s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)

$xmlElem = 'media:content'
$content = preg_replace('#<' . $xmlElem . '(?:'s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)

$xmlElem = 'position'
$content = preg_replace('#<' . $xmlElem . '(?:'s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)

免责声明:为了灵活性和可靠性,您应该始终使用适当的解析器(如DOMDocument)来操作XML/HTML。也就是说,如果确信您的标记格式良好,不受结构更改的影响,并且不会包含嵌套的重复标记,则正则表达式可以解决此类问题。但只有当你知道自己在做什么的时候,你才应该使用它们。


您需要使用preg_replace()将每个匹配项替换为空字符串("")。以下是如何对<author>...</author>块执行此操作:

$markup = preg_replace('#<author>(.*?)</author>#is', '', $markup);

基本上,这与开始标签<author>、开始/结束标签之间的任何东西(或什么都没有)以及结束标签</author>相匹配。

其他标签可以以类似的方式移除。