PHP 正则表达式崩溃的 apache - PHP regex crashing apache

我有一个正则表达式，可以匹配模板系统，不幸的是，它似乎在一些微不足道的查找中使apache崩溃（它在Windows上运行）。我已经研究了这个问题，有一些关于增加堆栈大小等的建议，这些似乎都不起作用，而且我真的不喜欢通过提高限制来处理此类问题，因为它通常只是将错误推向未来。

无论如何，关于如何更改正则表达式以使其不太可能犯规的任何想法？

这个想法是捕获最里面的块（在本例中为 {block:test}This should be caught first!{/block:test}），然后我将str_replace出开始/结束标签并通过正则表达式重新运行整个内容，直到没有块。

正则表达式：

~(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9's_-]+)})(?P<contents>(?:(?!{/?block:[0-9a-z-_]+}).)*)(?P<closing>{/block:'3})~ism

示例模板：

<div class="f_sponsors s_banners">
    <div class="s_previous">&laquo;</div>
    <div class="s_sponsors">
        <ul>
            {block:sponsors}
            <li>
                <a href="{var:url}" target="_blank">
                    <img src="image/160x126/{var:image}" alt="{var:name}" title="{var:name}" />
                </a>
            {block:test}This should be caught first!{/block:test}
            </li>
            {/block:sponsors}
        </ul>
    </div>
    <div class="s_next">&raquo;</div>
</div>

我想这是一个很长的镜头。 :(

试试这个：

'~(?P<opening>'{(?P<inverse>[!])?block:(?P<name>[a-z0-9's_-]+)'})(?P<contents>[^{]*(?:'{(?!/block:(?P=name)'})[^{]*)*)(?P<closing>'{/block:(?P=name)'})~i'

或者，以可读的形式：

'~(?P<opening>
  '{
  (?P<inverse>[!])?
  block:
  (?P<name>[a-z0-9's_-]+)
  '}
)
(?P<contents>
  [^{]*(?:'{(?!/block:(?P=name)'})[^{]*)*
)
(?P<closing>
  '{
  /block:(?P=name)
  '}
)~ix'

最重要的部分是在(?P<contents>..)组中：

[^{]*(?:'{(?!/block:(?P=name)'})[^{]*)*

首先，我们唯一感兴趣的字符是左大括号，因此我们可以用[^{]*来啜饮任何其他字符。只有在我们看到{后，我们才会检查它是否是{/block}标签的开头。如果不是，我们继续使用它并开始扫描下一个，并根据需要重复。

使用RegexBuddy，我通过将光标放在{block:sponsors}标签的开头并进行调试来测试每个正则表达式。然后，我从结束{/block:sponsors}标记中删除了结束大括号以强制失败的匹配并再次调试它。您的正则表达式需要 940 步才能成功，需要 2265 步才能失败。我的成功花了57步，失败了83步。

附带说明一下，我删除了s修饰符，因为我没有使用点（ . ），以及 m 修饰符，因为它从来不需要。我还根据@DaveRandom的出色建议使用了命名的反向引用(?P=name)而不是'3。我省去了所有的大括号（{和}），因为我发现这样读起来更容易。

编辑：如果要匹配最里面的命名块，请从下面更改正则表达式的中间部分：

(?P<contents>
  [^{]*(?:'{(?!/block:(?P=name)'})[^{]*)*
)

。对此（正如@Kobi在他的评论中所建议的那样）：

(?P<contents>
  [^{]*(?:'{(?!/?block:[a-z0-9's_-]+'})[^{]*)*
)

最初，(?P<opening>...)组将获取它看到的第一个开始标签，然后(?P<contents>..)组将消耗任何内容（包括其他标签），只要它们不是与(?P<opening>...)组找到的标签匹配的结束标签。（然后(?P<closing>...)组将继续使用它。

现在，无论名称是什么，(?P<contents>...)组都拒绝匹配任何标签、开始或结束（请注意开头的/?）。因此，正则表达式最初开始匹配{block:sponsors}标签，但是当它遇到{block:test}标签时，它会放弃该匹配并返回搜索开始标签。它再次从{block:test}标记开始，这次在找到{/block:test}结束标记时成功完成匹配。

这样描述它听起来效率低下，但事实并非如此。我之前描述的技巧，啜饮非大括号，淹没了这些错误开始的影响。你几乎在每个位置都做负面的展望，现在你只在遇到{时才做一个。您甚至可以使用所有格量词，如@godspeedlee建议的那样：

(?P<contents>
  [^{]*+(?:'{(?!/?block:[a-z0-9's_-]+'})[^{]*+)*+
)

。因为你知道它永远不会消耗任何它以后必须回馈的东西。这会加快速度，但实际上没有必要。

解决方案必须是单个正则表达式吗？更有效的方法可能是简单地查找{/block:的第一个匹配项（可以是简单的字符串搜索或正则表达式），然后从该点向后搜索以找到其匹配的开始标记，适当地替换跨度并重复直到没有更多的块。如果每次您都从模板顶部开始寻找第一个结束标记，那么这将为您提供最深嵌套的块。

镜像算法也可以正常工作 - 查找最后一个开始标记，然后从那里向前搜索相应的结束标记：

<?php
$template = //...
while(true) {
  $last_open_tag = strrpos($template, '{block:');
  $last_inverted_tag = strrpos($template, '{!block:');
  // $block_start is the index of the '{' of the last opening block tag in the
  // template, or false if there are no more block tags left
  $block_start = max($last_open_tag, $last_inverted_tag);
  if($block_start === false) {
    // all done
    break;
  } else {
    // extract the block name (the foo in {block:foo}) - from the character
    // after the next : to the character before the next }, inclusive
    $block_name_start = strpos($template, ':', $block_start) + 1;
    $block_name = substr($template, $block_name_start,
        strcspn($template, '}', $block_name_start));
    // we now have the start tag and the block name, next find the end tag.
    // $block_end is the index of the '{' of the next closing block tag after
    // $block_start.  If this doesn't match the opening tag something is wrong.
    $block_end = strpos($template, '{/block:', $block_start);
    if(strpos($template, $block_name.'}', $block_end + 8) !== $block_end + 8) {
      // non-matching tag
      print("Non-matching tag found'n");
      break;
    } else {
      // now we have found the innermost block
      // - its start tag begins at $block_start
      // - its content begins at
      //   (strpos($template, '}', $block_start) + 1)
      // - its content ends at $block_end
      // - its end tag ends at ($block_end + strlen($block_name) + 9)
      //   [9 being the length of '{/block:' plus '}']
      // - the start tag was inverted iff $block_start === $last_inverted_tag
      $template = // do whatever you need to do to replace the template
    }
  }
}
echo $template;

您可以使用

atomic group: (?>...)或possessive quantifiers: ?+ *+ ++..来抑制/限制回溯并通过unrolling loop技术加速匹配。我的解决方案：

'{block:('w++)'}([^<{]++(?:(?!'{'/?block:'1'b)[<{][^<{]*+)*+)'{/block:'1'}

我已经从 http：//regexr.com？31p03 进行了测试。

匹配{block:sponsors}...{/block:sponsors} ：

'{block:(sponsors)'}([^<{]++(?:(?!'{'/?block:'1'b)[<{][^<{]*+)*+)'{/block:'1'}http：//regexr.com？31rb3

匹配{block:test}...{/block:test} ：

'{block:(test)'}([^<{]++(?:(?!'{'/?block:'1'b)[<{][^<{]*+)*+)'{/block:'1'}http：//regexr.com？31RB6

另一个解决方案：
在 PCRE 源代码中，您可以从config.h中删除注释：

/* #undef NO_RECURSE */

以下文本副本来自config.h：
PCRE 使用递归函数调用来处理匹配时的回溯。这有时在具有有限大小的堆栈的系统上可能是一个问题。定义NO_RECURSE以获取在 match（）函数中不使用递归的版本;相反，它使用 pcre_recurse_malloc（）通过 Steam 创建自己的堆栈，以从堆中获取内存。

或者，您可以从php.ini更改pcre.backtrack_limit和pcre.recursion_limit （http://www.php.net/manual/en/pcre.configuration.php）