我有以下数据
<description><div dir="ltr" style="text-align: left;" trbidi="on"><div class="MsoNormal"><i><span style="font-family: Georgia, Times New Roman, serif; font-size: xx-small;">By Marina Correa</span></i></div><div class="MsoNormal"><i><span style="font-family: Georgia, Times New Roman, serif; font-size: xx-small;">Photography: Courtesy the architect</span><span style="font-family: Georgia, serif; font-size: 9pt;"><o:p></o:p></span></i></div><div class="MsoNormal"><br></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-D1JRy4epwOM/UooCcR-U7lI/AAAAAAAALyM/tDr2ezxnb-I/s1600/Prost_Beer_+House_AH_Design_Indiaartndesign.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Prost Beer House in Bengaluru, India,by AH design." border="0" src="http://3.bp.blogspot.com/-D1JRy4epwOM/UooCcR-U7lI/AAAAAAAALyM/tDr2ezxnb-I/s1600/Prost_Beer_+House_AH_Design_Indiaartndesign.jpg" title=""></a></td></tr><tr><td class="tr-caption" style="text-align: right;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;">.</span></td></tr></tbody></table><div class="MsoNormal"><br></div><div class="MsoNormal"></div><div style="text-align: justify;"><span style="font-family: Georgia, &#39;Times New Roman&#39;, serif;">Evolving from carnage of shipwrecked metal, the interiors of Prost Beer House in Bengaluru, India, make it an attention-grabbing drinking hole…</span></div></div><a href="http://inditerrain.indiaartndesign.com/2013/11/beerhouse-rock.html#more">Read more »</a><img src="http://feeds.feedburner.com/~r/IndiaArtNDesign/~4/jGC75D3KB0o" height="1" width="1"/></description>
但是,我没有"<",也没有">",我有">;"
我需要一个正则表达式来查找不在html标签中的数据,即实际文本,而不是标签的名称、类名等。
对于用"<"answers">"解析html,我发现:(?<=^|>)[^><]+?(?=<|$)
尽管我不知道如何转换它来满足我的需要。非常感谢您的帮助
这看起来像XML中的HTML片段,在RSS提要的描述中更具体。如果是这种情况,您应该使用DOM解析RSS,这将对实体进行漫长的解码:
$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);
迭代项目:
foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {
项目的标题只是一个文本值,可以直接使用:
echo 'Title: ', $xpath->evaluate('string(title)', $rssItem), "'n";
您的示例中的描述包含带有转义实体的文本节点中的html片段,我看到过其他带有CDATA的示例。这对于外部xml文档来说并不重要。它是文本,如果你把它读成文本,实体将被转换回它们各自的字符。
$description = $xpath->evaluate('string(description)', $rssItem);
所以现在$description包含<和>再次。它可以用loadHtml()加载到DOM中,也可以用strip_tags()清理。
echo 'Description: ', strip_tags($description), "'n'n";
完整的例子(RSS改编自维基百科):
$rss = <<<'RSS'
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
<item>
<title>Example entry</title>
<description>Here is some <b>text</b> containing an interesting <i>description</i> with <span class="important">html</span>.</description>
</item>
</channel>
</rss>
RSS;
$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);
foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {
echo 'Title: ', $xpath->evaluate('string(title)', $rssItem), "'n";
$description = $xpath->evaluate('string(description)', $rssItem);
echo 'Description: ', strip_tags($description), "'n'n";
}
输出:
Title: Example entry
Description: Here is some text containing an interesting description with html.
对于解码,您可以使用htmlspecialchars_decode
有关详细信息,请查看http://php.net/manual/en/function.htmlspecialchars-decode.php
要快速获得原始文本(无标记),可以进行以下替换:
$result = preg_replace('~<.*?>~s', ' ', $source);
这会将您要查找的所有文本作为一个数组:
preg_match_all("/(?<=>)(?!<).*?(?=<)/", $source, $result);
查看此正则表达式与示例输入的实时演示。