解析平衡的嵌套wiki模板并提取单行参数';s的内容


Parsing balanced nested wiki templates and extract a single line parameter's content by a regexp

我知道解析嵌套字符串或HTML最好由真正的解析器来完成,但在我的情况下,我有简单的模板,希望从模板中提取Wiki参数"title"的标题内容。我花了一段时间才做到这一点,但多亏了Lars Olav Torvik的正则表达式工具(http://regex.larsolavtorvik.com/)和这个用户论坛在这里我得到了它。可能有人觉得它有用。(我们都想贡献,他,不是吗?;-)下面用注释注释的代码就可以了。我不得不用环视断言来做这件事,以免两个模板混合在一起,因为其中一个模板中没有标题。

我还不确定regex注释中的两个问题——参见(?# Questions: …)——是否理解了(?R)中的递归部分。它是否从最外层定义的级别(即第二行正则表达式'{'{和最后一行正则表达式'}'})获取要检查的内容?这是正确的吗?在(?R)展位同等工作之前,+++之间的区别是什么。

  1. 页面上的原始wiki模板(最简单):

    $wikiTemplate = "
    {{Templ1
    | title = (1. template) title
    }}
    {{Templ2
    | any parameter = something {{template}}
    }}
    {{Templ1
    | title = (3. template) title
    }}
    ";
    
  2. 更换:

    $wikiTemplate = preg_replace(
      array(
      // tag all templates with START … END and add a TITLE-placeholder before
      // and take care of balanced {{ …  }} recursiveness 
        "@(?s)   (?# switch to dotall match, i.e. also linebreaks )
          '{'{ (?# find two {{ )
          (?: (?# group 1 as a non-backreferenced match  )
            (?:  (?# group 2 as a non-backreferenced match  )
              (?! (?# in group 1 anything but not {{ or }} )
                '{'{ 
                |   (?# or )
                '}'}
              )
              .
            )++  (?# Question: what is the differenc between ++ and + here? )
            |    (?# or )
            (?R) (?# is it recursive of what is defined in the outermost,
                  i.e. 2nd regexp line with '{'{ and last line with '}'}
                  Question: is that here understood correctly? ) 
          )
          * (?# zero or many times of the inner regexp defintions )
          '}'} (?# find two }} )
        @x",// x-extended → ignore white space in the pattern
      // replace TITLE by single line content of title parameter 
        "@
          (?<=TITLE) (?# TITLE must preceed the following linebreak but is not
                      backreferenced within ''0, i.e. the whole returned match)
          (['n'r]+)  (?#linebr in 1 may also described as . because of
                      s-modifier dotall)
          (?:        (?# start non-backreferenced match )
            .        (?# any character but not followed by START)
            (?!START)
          )+      (?# multiple times)
          (?:     (?# start non-backreferenced match )
            '|'s*title's*='s* (?#find the parameter '| title = ')
          )
          ([^'r'n]+)  (?#get title now to ''2 but exclude the line break. 
                       Note it is buggy when there is no line break )
          (?:     (?# start non-backreferenced match )
            .     (?# any character but not followed by END)
            (?!END)
          )
          +       (?# multiple times)
          .       (?# any single character, e.g. the last  because as all
                   stuff before captures anything not followed by END)
          (?:END) (?#a not backreferenced END)
        @msx", // m-multiline, s-dotall match also linebreaks,
               // x-extended → ignore white space in the pattern
      ), 
      array(
        "TITLE'nSTART''0END", // '0 is the whole returned match, i.e. the template
      # replace the TITLE to  TITLEtitle contentTITLE…
        "''2TITLE''0",
      ),
      $wikiTemplate
    );
    print_r($wikiTemplate);
    
  3. 然后输出的标题由每个模板上方的TITLE标记,但前提是有标题:

    TITLE(1. template) titleTITLE
    START{{Templ1
     | title = (1. template) title
    }}END
    TITLE
    START{{Templ2
     | any parameter = something {{template}}
    }}END
    TITLE(3. template) titleTITLE
    START{{Templ1
     | title = (3. template) title
    }}END
    

我对regexp的理解或一些改进有什么疑问吗?谢谢你,安德里亚斯。

++是一个所有格量词。如果你在任何重复量词(+*{...})后面加上一个+,它就会变成所有格。这意味着一旦正则表达式引擎第一次离开重复,它就不会回溯并尝试更少的重复。所以他们基本上使重复成为一个原子群。有时这是一种优化,有时它确实会产生影响。你可以在这里做一些非常好的阅读。

关于您的第二个问题,(?R)将再次尝试匹配完整模式。为此,在PCRE的PHP文档中可以找到一篇很好的文章。

对于您的其他问题,最好在代码审查上提问。