需要处理的文本示例如下:
GLENSTAL特熟冷切达干酪200gms,原味华夫饼公司。英式130克,lifco - mozareal -切碎500克,摩羯座美味面包-大的,lusine多粒切片面包,有机混合果汁10x200ml,可乐330ml(016)凤凰有机,果汁10x200ml,有机果汁500ML10X
从这篇文章中,我必须提取重量、单位和包装,如"10X或6X"。我试图解决它使用正则表达式,但它不是在所有条件下工作。
我尝试的代码是:
$weight_unit = explode(" ", $title_string);
$units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
for ($m = 0; $m < sizeof($weight_unit); $m++) {
foreach ($units as $unit) {
if (preg_match('/^[0-9A-Z.]*([0-9][A-Z]|[A-Z][0-9])[0-9A-Z]*$/',
$weight_unit[$m]) && strpos($weight_unit[$m], $unit) !== FALSE) {
$product["weight"] = preg_replace("/[A-Za-z]/", '', $weight_unit[$m]);
$product["unit"] = $unit;
break;
}
}
}
你可以试试:
('d+X's?)?'d+'s?(LITRE|LTRS|LTR|LIT|GMS|LBS|KG|GM|GR|ML|OZ|LB|G|L)('d+X's?)?
如果您只希望这些单元匹配。regex:
-
('d+X's?)?
-可以与X (10X等)匹配一个或多个数字, -
'd+'s?
-一个或多个数字和一个或无空白字符 -
(LITRE|LTRS|LTR|LIT|GMS|LBS|KG|GM|GR|ML|OZ|LB|G|L)
-替代品你的单位, -
('d+X's?)?
-可以在单位 后面用X匹配一个或多个数字
用一个正则表达式来完成所有这些可能不值得。也许你可以让它工作,但下一个工作的人将会有一段艰难的时间,除非她习惯于对着调制解调器吹口哨。:-)让我们试试一系列嵌套循环。
$txt = "GLENSTAL EXTRA MATURE COL CHEDDAR 200 GMS, ORIGINAL WAFFLES CO. ENGLISH 130G, LIFCO-SHREDDED MOZAREAL-500GM, CAPRICON TASTY BREAD -BIG, LUSINE MULTI GRAIN SLICED BREAD, ORGANIC MIXED FRUITS JUICE 10X200ML, COLA 330ML(016) PHOENIX ORGANIC, FRUITS JUICE 10X 200ML, ORGANIC FRUITS JUICE 500ML10X";
$units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
/* break up your string at the commas, so you handle each item by itself */
$items = preg_split("/'s*,'s*/", $txt);
/* work through the items one by one */
foreach ($items as $item) {
$amtnum = 1;
$amtunit = "";
$packnum = "1";
/* break up the item description into tokens, where
* each number string and letter string gets its own token.
* deal with (123) parenthesized number strings as well.
* e.g. "FRUITS JUICE" "10" "X" "200" "ML"
* and "COLA" "330" "ML" "(016)" "PHOENIX ORGANIC"
*/
$toks = preg_split("/('('d+')|'d+|[^'d'(')]+)/", $item,-1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
/* work backward through array of tokens, using array_pop */
while ($tok = array_pop($toks)) {
/* is the present token in your array of units? */
if (in_array(strtoupper($tok), $units)) {
/* yes. grab next token as the number of units */
$amtunit = $tok;
$amtnum = array_pop($toks);
}
/* is this an X (for a 16X pack or some such thing ? */
if ($tok == 'X') {
/* yes, grab next token as the number of items in the pack */
$packnum = array_pop($toks);
}
/* do what you will with the result */
echo $amtnum, $amtunit, $packnum;
}
}
这一行是解决你问题的关键。
$toks = preg_split(
"/('('d+')|'d+|[^'d'(')]+)/",
$item,-1,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
preg_split
将字符串拆分为数组。PREG_SPLIT_DELIM_CAPTURE
作为修饰符意味着在结果数组中包含来自正则表达式的内容。PREG_SPLIT_NO_EMPTY
表示在结果数组中不包含空字符串。
让我们看看正则表达式本身。我要加上空格,以便于阅读。
( '('d+') | 'd+ | [^'d'(')]+ )
以圆括号()
开头和结尾。这与PREG_SPLIT_DELIM_CAPTURE
一致。
然后包含三个可选的匹配表达式,以|
分隔。
第一个是圆括号、一个数字和一个圆括号。它与测试数据集中的字符串(016)
匹配。
第二个是一个普通数字。可以匹配"300"之类的内容
第三个是由字母、空格等组成的字符串,除了数字和括号。例如,匹配"GMS"answers"FRUITS JUICE"。
这可能是使用正则表达式完成解析工作的一种相当健壮的方法。