我有100个文件,每个文件包含x数量的新闻文章。文章通过以下缩写的部分进行结构:
HD BY WC PD SN SC PG LA CY LP TD CO IN NS RE IPC PUB AN
其中CCD_ 1和CCD_。
典型的消息如下所示:
HD Corporate News: Alcoa Earnings Soar; Outlook Stays Upbeat
BY By James R. Hagerty and Matthew Day
WC 421 words
PD 12 July 2011
SN The Wall Street Journal
SC J
PG B7
LA English
CY (Copyright (c) 2011, Dow Jones & Company, Inc.)
LP
Alcoa Inc.'s profit more than doubled in the second quarter, but the giant
aluminum producer managed only to meet analysts' recently lowered forecasts.
Alcoa serves as a bellwether for U.S. corporate earnings because it is the
first major company to report and draws demand from a wide range of
industries.
TD
The results marked an early test of how corporate optimism is holding up
in the face of bleak economic news.
License this article from Dow Jones Reprint
Service[http://www.djreprints.com/link/link.html?FACTIVA=wjco20110712000115]
CO
almam : ALCOA Inc
IN
i2245 : Aluminum | i22 : Primary Metals | i224 : Non-ferrous Metals | imet
: Metals/Mining
NS
c15 : Performance | c151 : Earnings | c1521 : Analyst
Comment/Recommendation | ccat : Corporate/Industrial News | c152 :
Earnings Projections | ncat : Content Types | nfact : Factiva Filters |
nfce : FC&E Exclusion Filter | nfcpin : FC&E Industry News Filter
RE
usa : United States | use : Northeast U.S. | uspa : Pennsylvania | namz :
North America
IPC
DJCS | EWR | BSC | NND | CNS | LMJ | TPT
PUB
Dow Jones & Company, Inc.
AN
Document J000000020110712e77c00035
在每篇文章之后,在一篇新文章开始之前有4行换行符。我需要将这些文章放入一个数组中,如下所示:
$articles = array(
[0] = array (
[HD] => Corporate News: Alcoa earnings Soar; Outlook...
[BY] => By James R. Hagerty...
...
[AN] => Document J000000020110712e77c00035
)
)
一种使用分解来分离每个块和正则表达式来提取字段的方法:
$pattern = <<<'LOD'
~
# definition
(?<fieldname> (?:HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)$ ){0}
# pattern
'G(?<key>'g<fieldname>) 's+
(?<value>
.+
(?: 'R{1,2} (?!'g<fieldname>) .+ )*+
)
(?:'R{1,3}|'z)
~xm
LOD;
$subjects = explode("'r'n'r'n'r'n'r'n", $text);
$result = array();
foreach($subjects as $i => $subject) {
if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
$result[$i][$match['key']]=$match['value'];
}
}
}
echo '<pre>', print_r($result, true);
图案细节:
图案分为两部分:
在定义部分,我编写了一个名为fieldname的子模式,以便稍后在主模式中使用它。此模式还检查每个字段名以$
锚定符结束的行。
主要模式:
'G # this forces the match to be contiguous to the
# precedent match or the start of the string (no gap)
(?<key> 'g<fieldname> ) # a capturing group named "key" for the fieldname
's+ # one or more white characters
(?<value> # open a capturing group named "value" for the
# field content
.+ # all characters except newlines 1 or more times
(?: # open an atomic group
'R'R?+ # one or two newlines to allow paragraphs (LP & TD)
(?!'g<fieldname>) # but not followed by a fieldname (only a check)
.+ #
)*+ # close the atomic group and repeat 0 or more times
) # close the capture group "value"
(?:'R{1,3}|'z) # between 1 or 3 newlines max. or the end of the
# string (necessary if i want contigous matches)
全局修改器:
- x(扩展模式):在模式中忽略空白和以#开头的内联注释
- m(多行模式):
^
匹配行的开始,$
匹配行的结束