使用 PCRE 正则表达式从文本解析电子邮件标头


Parse email header from text using PCRE regular expression

我需要解析(拆分)一个包含从Outlook导出的电子邮件的文本文件。我正在使用preg_splitPREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE拆分它

我的目标是使用正则表达式捕获消息标头部分,即从"From:"行开始,以消息正文之前的空行结束。

约束:

  • 预期的多语言字段名称
  • 标题字段的数量各不相同(抄送、密件抄送、附件)
  • 某些字段可能位于多行上(收件人、抄送、密件抄送、主题、附件)

对文本文件进行预处理:将多个空格和制表符替换为单个空格,替换前导空格和尾随空格。

我已经整天都在工作,无法让最后一部分工作。它确实适用于 [gskinner 正则表达式测试页面]:http://regexr.com?36v27 ,但在 php 中不起作用。

主题:

From: Black, Jack (LA)
Sent: Monday, October 28, 2013 6:36 PM
To: George, Jackson (London); DCS.CC.DARWIN (Australia)
Cc: Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo,
John J. (Gaziantep); Matrix, John (Phuket)
Subject: RE: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR
Dear George,
venenatis imperdiet quam. Proin a egestas nunc, et mattis elit. In hac habitasse platea dictumst. Nulla dolor nibh, tempus ut neque eu, tempus fermentum mauris. Mauris nec ipsum nec sapien commodo scelerisque ut eu urna. Pellentesque eu neque in enim adipiscing faucibus. Sed interdum arcu et sem mollis iaculis. Duis euismod laoreet ligula lacinia dapibus. Vestibulum ullamcorper malesuada metus at malesuada. 
 Nullam enim elit, auctor vehicula orci eget, imperdiet feugiat odio. Etiam dapibus sagittis sem a varius. Nulla sit amet convallis mi, sit amet rutrum ipsum. In libero lectus, mattis at dui eu.
Thank you and best regards,
Jack B. Black (Mr)
Operations Manager (GGD)
FU Supervisor (R34, R57)
Phone: +1112212212 (local 1111)
Mobile: +12 121.111.11.12
From: George, Jackson (UK)
Sent: Monday, October 28, 2013 5:57 PM
To: DCS.CC.DARWIN (Australia)
Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo,
John J. (Gaziantep); Matrix, John (Phuket)
Subject: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR
Dear Colleagues,
ermentum. Duis ipsum quam, bibendum a risus nec, tincidunt fringilla lectus. Nunc vel dictum massa, et cursus nunc. Mauris tincidunt felis eget justo congue volutpat. Nulla condimentum accumsan elementum. Integer commodo, lorem eu pharetra suscipit, ligula.
Best Regards.
SDFD srfgGD
Field coordinator (GGD)
Customer Representative
sds dfsd sdfgsef sdfsd
sgzdfgdfg fgfg gdfg
Footer text etc
sdfdfdf dfgsdfgsdfgsdfg
Phone : +90 212 368 40 00 (ext:3814)

正则表达式:

preg_match(
                 '/                         # delimiter
                (                           # capturing group start
                [' A-Z][a-z]+:.+'(.+')'R    # From: field
                [A-Z][a-z]+:.+'R            # Sent: fields
                [A-Z][a-z]+:.+'R            # To: field (1st line)
                (?:.+'R)+              # any additional header lines, before blank line (To, CC, BCC, Subject, Attachments)
                )                           # capturing group end
                # delimiter + modifiers /x',$text_clean, $matches);
        echo '<b>Matches: '.count($matches).'</b>';
        print_r($matches);   

我在获取其他标题行时遇到问题:

(?:.+'R)+              # any additional header lines...

任何帮助不胜感激

最短的方法是将preg_match_all与惰性量词一起使用:

preg_match_all('/^From.*?'R'R/ims', $mails, $matches);
print_r($matches);

感谢大家的输入,但是我使用我的方法找到了它。有几点让我感到困惑,但工作解决方案在下面进一步。

  1. 为什么preg_match返回第一个结果两次而不是两个匹配项:(http://www.ideone.com/Xj6aaF)1

  2. (?:.+'R)+ 点似乎与任何字符匹配,没有字符,这就是为什么它一直缺少空白行的原因。我觉得很奇怪——+不应该是1 or more quantifier吗?

无论如何,当我将正则表达式模式更改为(?:'S.+'R)+时,它使用 preg_split 可以做我想做的事。

演示

虽然,从技术上讲,我的问题已经解决了,但我希望有人解释上述两点。