用于文本之间匹配的正则表达式


Regular expression for matching between text

我有一个文件,其中包含从apachehttp日志自动生成的统计数据。

我真的很难在两段文字之间进行匹配。这是我拥有的stat文件的一部分:

jpg 6476 224523785 0 0
Unknown 31200 248731421 0 0
gif 197 408771 0 0
END_FILETYPES
# OS ID - Hits
BEGIN_OS 12
linuxandroid 1034
winlong 752
winxp 1320
win2008 204250
END_OS
# Browser ID - Hits
BEGIN_BROWSER 79
mnuxandroid 1034
winlong 752
winxp 1320

我要做的是编写一个正则表达式,它将在标记BEGIN_OS 12END_OS之间搜索。

例如,我想创建一个包含操作系统和命中率的PHP数组(我知道实际的数组实际上不会完全像这样,但只要我有这些数据):

array(
   [0] => array(
      [0] => linuxandroid
      [1] => winlong
      [2] => winxp
      [3] => win2008
   )
   [1] => array(
      [0] => 1034
      [1] => 752
      [2] => 1320
      [3] => 204250
   )
)

我已经用gskinner regex测试仪测试正则表达式好几个小时了,但regex远不是我的强项。

我会发布到目前为止我得到的东西,但我已经尝试了很多,我得到的最接近的是:

^[BEGIN_OS's12]+([a-zA-Z0-9]+)'s([0-9]+)

太可怕了!

任何帮助都将不胜感激,即使这是"不可能做到的"。

正则表达式可能不是完成此任务的最佳工具。您可以使用正则表达式获取所需的子字符串,然后使用PHP的字符串操作函数进行进一步处理。

$string = preg_replace('/^.*BEGIN_OS 'd+'s*(.*?)'s*END_OS.*/s', '$1', $text);
foreach (explode(PHP_EOL, $string) as $line) {
    list($key, $value) = explode(' ', $line);
    $result[$key] = $value;
}
print_r($result);

应该给你以下输出:

Array
(
    [linuxandroid] => 1034
    [winlong] => 752
    [winxp] => 1320
    [win2008] => 204250
)

您可以尝试以下操作:

/BEGIN_OS 12's(?:(['w'd]+)'s(['d]+'s))*END_OS/gm

你仍然需要解析匹配的结果,你也可以用类似的东西来简化它:

/BEGIN_OS 12(['s'S]*)END_OS/gm

然后只需解析第一组(它们之间的文本),然后在''n'' '上进行拆分,即可获得所需的部分。

编辑

Regexs带注释:

/BEGIN_OS 12          // Match "BEGIN_OS 12" exactly
 's                   // Match a whitespace character after
 (?:                  // Begin a non-capturing group
   (['w'd]+)          // Match any word or digit character, at least 1 or more
   's                 // Match a whitespace character
   (['d]+'s)          // Match a digit character, at least one or more
 )*                   // End non-capturing group, repeate group 0 or more times           
 END_OS               // Match "END_OS" exactly
/gm                   // global search (g) and multiline (m)

简单的版本:

/BEGIN_OS 12          // Match "BEGIN_OS 12" exactly
  (                   // Begin group
    ['s'S]*           // Match any whitespace/non-whitespace character (works like the '.' but captures newlines
  )                   // End group
  END_OS              // Match "END_OS" exactly
/gm                   // global search (g) and multiline (m)

辅助编辑

您的尝试:

^[BEGIN_OS's12]+([a-zA-Z0-9]+)'s([0-9]+)

不会给你预期的结果。如果你把它拆开:

^                     // Match the start of a line, without 'm' this means the beginning of the string.
[BEGIN_OS's12]+       // This means, match a character that is any [B, E, G, I, N, _, O, S, 's, 1, 2] 
                      // where there is at least 1 or more. While this matches "BEGIN_OS 12" 
                      // it also matches any other lines that contains a combination of those 
                      // characters or just a line of whitespace thanks to 's).
([a-zA-Z0-9]+)        // This should match the part you expect, but potentially not with the previous rules in place.
's
([0-9]+)              // This is the same as ['d]+ or 'd+ but should match what you expect (again, potentially not with the first rule)