我知道解析嵌套字符串或HTML最好由真正的解析器来完成,但在我的情况下,我有简单的模板,希望从模板中提取Wiki参数"title"的标题内容。我花了一段时间才做到这一点,但多亏了Lars Olav Torvik的正则表达式工具(http://regex.larsolavtorvik.com/)和这个用户论坛在这里我得到了它。可能有人觉得它有用。(我们都想贡献,他,不是吗?;-)下面用注释注释的代码就可以了。我不得不用环视断言来做这件事,以免两个模板混合在一起,因为其中一个模板中没有标题。
我还不确定regex注释中的两个问题——参见(?# Questions: …)
——是否理解了(?R)
中的递归部分。它是否从最外层定义的级别(即第二行正则表达式'{'{
和最后一行正则表达式'}'}
)获取要检查的内容?这是正确的吗?在(?R)
展位同等工作之前,++
和+
之间的区别是什么。
-
页面上的原始wiki模板(最简单):
$wikiTemplate = " {{Templ1 | title = (1. template) title }} {{Templ2 | any parameter = something {{template}} }} {{Templ1 | title = (3. template) title }} ";
-
更换:
$wikiTemplate = preg_replace( array( // tag all templates with START … END and add a TITLE-placeholder before // and take care of balanced {{ … }} recursiveness "@(?s) (?# switch to dotall match, i.e. also linebreaks ) '{'{ (?# find two {{ ) (?: (?# group 1 as a non-backreferenced match ) (?: (?# group 2 as a non-backreferenced match ) (?! (?# in group 1 anything but not {{ or }} ) '{'{ | (?# or ) '}'} ) . )++ (?# Question: what is the differenc between ++ and + here? ) | (?# or ) (?R) (?# is it recursive of what is defined in the outermost, i.e. 2nd regexp line with '{'{ and last line with '}'} Question: is that here understood correctly? ) ) * (?# zero or many times of the inner regexp defintions ) '}'} (?# find two }} ) @x",// x-extended → ignore white space in the pattern // replace TITLE by single line content of title parameter "@ (?<=TITLE) (?# TITLE must preceed the following linebreak but is not backreferenced within ''0, i.e. the whole returned match) (['n'r]+) (?#linebr in 1 may also described as . because of s-modifier dotall) (?: (?# start non-backreferenced match ) . (?# any character but not followed by START) (?!START) )+ (?# multiple times) (?: (?# start non-backreferenced match ) '|'s*title's*='s* (?#find the parameter '| title = ') ) ([^'r'n]+) (?#get title now to ''2 but exclude the line break. Note it is buggy when there is no line break ) (?: (?# start non-backreferenced match ) . (?# any character but not followed by END) (?!END) ) + (?# multiple times) . (?# any single character, e.g. the last because as all stuff before captures anything not followed by END) (?:END) (?#a not backreferenced END) @msx", // m-multiline, s-dotall match also linebreaks, // x-extended → ignore white space in the pattern ), array( "TITLE'nSTART''0END", // '0 is the whole returned match, i.e. the template # replace the TITLE to TITLEtitle contentTITLE… "''2TITLE''0", ), $wikiTemplate ); print_r($wikiTemplate);
-
然后输出的标题由每个模板上方的TITLE标记,但前提是有标题:
TITLE(1. template) titleTITLE START{{Templ1 | title = (1. template) title }}END TITLE START{{Templ2 | any parameter = something {{template}} }}END TITLE(3. template) titleTITLE START{{Templ1 | title = (3. template) title }}END
我对regexp的理解或一些改进有什么疑问吗?谢谢你,安德里亚斯。
++
是一个所有格量词。如果你在任何重复量词(+
、*
、{...}
)后面加上一个+
,它就会变成所有格。这意味着一旦正则表达式引擎第一次离开重复,它就不会回溯并尝试更少的重复。所以他们基本上使重复成为一个原子群。有时这是一种优化,有时它确实会产生影响。你可以在这里做一些非常好的阅读。
关于您的第二个问题,是(?R)
将再次尝试匹配完整模式。为此,在PCRE的PHP文档中可以找到一篇很好的文章。
对于您的其他问题,最好在代码审查上提问。