我需要从从网页剥离HTML标签返回的纯文本中提取数据。标记被去掉了,因为页面由表格数据组成,但是表格嵌套在表格中,表格嵌套在表格中,等等(非常难看的HTML代码)。在清理代码(使用HTML Tidy)并剥离标签之后,站点返回如下信息:
Visitor ID : 123456789 HostName: 127.0.01 IP : 127.0.0.1 First Visit -> Entry Page : First Visit Entry Page Title Example First Visit -> Referrer: http://somepage.com First Visit : 302 Day(s) Last Visit : 09/23/2011 ISP: Initech Country: Some country Country: Some country Browser: Chrome Screen Res: Unknow 4 Billion colors (32 bit) Javascript: Enabled Page Views: 1 File Downloaded: 0 Daily Visits: 1 Visit Length: 0 minutes 0 seconds Entry Page: Entry page title Exit Page: Exit page title Referring URL: No
(如你所见,很长很乱)
我想把它变成这样:
Visitor ID: 123456789
HostName: 127.0.01
IP: 127.0.01
First Visit: 302 Day(s)
First Visit -> Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage.com
Last Visit: 09/23/2011
ISP: Initech
Country: Some country
Country: Some country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit)
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No
我目前正在使用regexp来删除额外的空白并尝试对数据进行排序。到目前为止,使用以下代码几乎可以正常工作:
$patterns = array("/HostName's*:/",
"/IP's*:/",
"/First's+Visit's+->'s+Entry's+Page's*:/",
"/First's+Visit's+->'s+Referrer's*:/",
"/First's+Visit's*:/",
"/'bLast's+Visit's*:/",
"/'bISP's*:/",
"/'bCountry's*:/",
"/'bBrowser's*:/",
"/'bScreen's*Res's*:/",
"/'bJavascript's*:/",
"/'bPage's+Views's*:/",
"/'bFile's+Downloaded's*:/",
"/'bDaily's+Visits's*:/",
"/'bVisit's+Length's*:/",
"/'bEntry's+Page's*:/",
"/'bExit's+Page's*:/",
"/'bReferring's+URL's*:/",
"/'bFrom's+Campaign's*:/" );
$replacements = array("'nHostName:",
"'nIP:",
"'nFirst Visit -> Entry Page:",
"'nFirst Visit -> Referrer:",
"'nFirst Visit:",
"'nLast Visit:",
"'nISP:",
"'nCountry:",
"'nBrowser:",
"'nScreen Res:",
"'nJavascript:",
"'nPage Views:",
"'nFile Downloaded:",
"'nDaily Visits:",
"'nVisit Length:",
"'nEntry Page:",
"'nExit Page:",
"'nReferring URL:",
"'nFrom Campaign:" );
ksort( $patterns );
ksort( $replacements );
$fixed_text = preg_replace ( $patterns, $replacements, $ugly_mess );
然而,这并没有像预期的那样工作。注意,有些字段是相似的,并且regexp无法工作,导致如下所示:
Visitor ID: 123456789
HostName: 127.0.0.1
IP: 127.0.0.1
Last Visit: 302 Day(s)
First Visit: 10 June 2010
First Visit ->
Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage
.com
ISP: Initech
Country: Some Country
Country: Some Country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit)
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No
我可能以错误的方式处理这个问题,所以这就是为什么我请求对当前代码的建议或修复。有什么想法吗?
与其使用replace pattern,不如使用match。我用的是javascript,但你可以很容易地把它改回PHP。
var pattern = "^(?:";
pattern += "(?:Visitor''s*ID''s*:''s*(''d+)''s*)";
pattern += "|(?:HostName's*:''s*([^ ]+)''s*)";
pattern += "|(?:IP''s*:''s*([^ ]+)''s*)";
pattern += "|(?:First''s*Visit''s*->''s*Entry Page''s*:''s*(.+?)''s*(?=First''s*Visit''s*->))";
pattern += "|(?:First''s*Visit''s*->''s*Referrer''s*:''s*(.+?)''s*(?=First''s*Visit''s*:))";
pattern += "|(?:First''s*Visit''s*:''s*(''d+)''s*Day''(s'')''s*)";
pattern += "|(?:Last''s*Visit''s*:''s*(''d+/''d+/''d+)''s*)";
pattern += "|(?:ISP''s*:''s*(.+?)''s*(?=Country''s*:))";
pattern += "|(?:Country''s*:''s*(.+?)''s*(?=(?:Country|Browser)''s*:))";
pattern += "|(?:Browser''s*:''s*(.+?)''s*(?=Screen''s*Res''s*:))";
pattern += "|(?:Screen''s*Res''s*:''s*(.+?)''s*(?=Javascript''s*:))";
pattern += "|(?:Javascript''s*:''s*(.+?)''s*(?=Page''s*Views''s*:))";
pattern += "|(?:Page''s*Views''s*:''s*(''d+)''s*)";
pattern += "|(?:File''s*Downloaded''s*:''s*(''d+)''s*)";
pattern += "|(?:Daily''s*Visits''s*:''s*(''d+)''s*)";
pattern += "|(?:Visit''s*Length''s*:''s*((?:''d+ (?:hours|minutes|seconds)''s*)+))";
pattern += ")+";
var regex = new RegExp(pattern);
var content = readData().replace(/ /g, "");
var match = content.match(regex);
echo("Visitor Id: " + match[1]);
echo("Hostname: " + match[2]);
echo("IP: " + match[3]);
// continue on...