使用regexp或其他更有效的方法从纯文本中提取信息


Extracting info from plain text using regexps or other, more efficient method

我需要从从网页剥离HTML标签返回的纯文本中提取数据。标记被去掉了,因为页面由表格数据组成,但是表格嵌套在表格中,表格嵌套在表格中,等等(非常难看的HTML代码)。在清理代码(使用HTML Tidy)并剥离标签之后,站点返回如下信息:

Visitor ID :   123456789   HostName: 127.0.01     IP :  127.0.0.1  First Visit -> Entry Page :   First   Visit    Entry    Page    Title    Example    First Visit -> Referrer: http://somepage.com   First Visit :  302 Day(s)    Last Visit :   09/23/2011    ISP: Initech   Country:  Some country Country:  Some  country    Browser: Chrome   Screen Res: Unknow 4 Billion colors (32 bit)   Javascript: Enabled   Page Views: 1     File Downloaded: 0  Daily Visits: 1 Visit Length: 0 minutes 0 seconds Entry Page: Entry page title Exit Page: Exit page title   Referring URL: No

(如你所见,很长很乱)

我想把它变成这样:

Visitor ID: 123456789
HostName: 127.0.01
IP: 127.0.01
First Visit: 302 Day(s)
First Visit -> Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage.com
Last Visit: 09/23/2011
ISP: Initech
Country: Some country
Country: Some country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit) 
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second 
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No

我目前正在使用regexp来删除额外的空白并尝试对数据进行排序。到目前为止,使用以下代码几乎可以正常工作:

$patterns       = array("/HostName's*:/",
                        "/IP's*:/",
                        "/First's+Visit's+->'s+Entry's+Page's*:/",
                        "/First's+Visit's+->'s+Referrer's*:/",
                        "/First's+Visit's*:/",
                        "/'bLast's+Visit's*:/",
                        "/'bISP's*:/",
                        "/'bCountry's*:/",
                        "/'bBrowser's*:/",
                        "/'bScreen's*Res's*:/",
                        "/'bJavascript's*:/",
                        "/'bPage's+Views's*:/",
                        "/'bFile's+Downloaded's*:/",
                        "/'bDaily's+Visits's*:/",
                        "/'bVisit's+Length's*:/",
                        "/'bEntry's+Page's*:/",
                        "/'bExit's+Page's*:/",
                        "/'bReferring's+URL's*:/",
                        "/'bFrom's+Campaign's*:/"   );
$replacements   = array("'nHostName:",
                        "'nIP:",
                        "'nFirst Visit -> Entry Page:",
                        "'nFirst Visit -> Referrer:",
                        "'nFirst Visit:",
                        "'nLast Visit:",
                        "'nISP:",
                        "'nCountry:",
                        "'nBrowser:",
                        "'nScreen Res:",
                        "'nJavascript:",
                        "'nPage Views:",
                        "'nFile Downloaded:",
                        "'nDaily Visits:",
                        "'nVisit Length:",
                        "'nEntry Page:",
                        "'nExit Page:",
                        "'nReferring URL:",
                        "'nFrom Campaign:"  );
ksort( $patterns );
ksort( $replacements );
$fixed_text      = preg_replace ( $patterns, $replacements, $ugly_mess );

然而,这并没有像预期的那样工作。注意,有些字段是相似的,并且regexp无法工作,导致如下所示:

Visitor ID: 123456789 
HostName: 127.0.0.1 
IP: 127.0.0.1 
Last Visit: 302 Day(s) 
First Visit: 10 June 2010 
First Visit -> 
Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage
.com
ISP: Initech 
Country: Some Country 
Country: Some Country 
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit) 
Javascript: Enabled  
Page Views: 1
File Downloaded: 0 
Daily Visits: 1
Visit Length: 1 minute(s) 26 second 
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No  

我可能以错误的方式处理这个问题,所以这就是为什么我请求对当前代码的建议或修复。有什么想法吗?

与其使用replace pattern,不如使用match。我用的是javascript,但你可以很容易地把它改回PHP。

  var pattern = "^(?:";
  pattern += "(?:Visitor''s*ID''s*:''s*(''d+)''s*)";
  pattern += "|(?:HostName's*:''s*([^ ]+)''s*)";
  pattern += "|(?:IP''s*:''s*([^ ]+)''s*)";
  pattern += "|(?:First''s*Visit''s*->''s*Entry Page''s*:''s*(.+?)''s*(?=First''s*Visit''s*->))";
  pattern += "|(?:First''s*Visit''s*->''s*Referrer''s*:''s*(.+?)''s*(?=First''s*Visit''s*:))";
  pattern += "|(?:First''s*Visit''s*:''s*(''d+)''s*Day''(s'')''s*)";
  pattern += "|(?:Last''s*Visit''s*:''s*(''d+/''d+/''d+)''s*)";
  pattern += "|(?:ISP''s*:''s*(.+?)''s*(?=Country''s*:))";
  pattern += "|(?:Country''s*:''s*(.+?)''s*(?=(?:Country|Browser)''s*:))";
  pattern += "|(?:Browser''s*:''s*(.+?)''s*(?=Screen''s*Res''s*:))";
  pattern += "|(?:Screen''s*Res''s*:''s*(.+?)''s*(?=Javascript''s*:))";
  pattern += "|(?:Javascript''s*:''s*(.+?)''s*(?=Page''s*Views''s*:))";
  pattern += "|(?:Page''s*Views''s*:''s*(''d+)''s*)";
  pattern += "|(?:File''s*Downloaded''s*:''s*(''d+)''s*)";
  pattern += "|(?:Daily''s*Visits''s*:''s*(''d+)''s*)";
  pattern += "|(?:Visit''s*Length''s*:''s*((?:''d+ (?:hours|minutes|seconds)''s*)+))";
  pattern += ")+";
  var regex = new RegExp(pattern);
  var content = readData().replace(/ /g, "");
  var match = content.match(regex);
  echo("Visitor Id: " + match[1]);
  echo("Hostname: " + match[2]);
  echo("IP: " + match[3]);
  // continue on...