如何从文本块中提取权重和其他元数据


How to extract weight and other meta data from a text block?

需要处理的文本示例如下:

GLENSTAL特熟冷切达干酪200gms,原味华夫饼公司。英式130克,lifco - mozareal -切碎500克,摩羯座美味面包-大的,lusine多粒切片面包,有机混合果汁10x200ml,可乐330ml(016)凤凰有机,果汁10x200ml,有机果汁500ML10X

从这篇文章中,我必须提取重量、单位和包装,如"10X或6X"。我试图解决它使用正则表达式,但它不是在所有条件下工作。

我尝试的代码是:

$weight_unit = explode(" ", $title_string);
 $units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
 for ($m = 0; $m < sizeof($weight_unit); $m++) {
   foreach ($units as $unit) {
     if (preg_match('/^[0-9A-Z.]*([0-9][A-Z]|[A-Z][0-9])[0-9A-Z]*$/',
          $weight_unit[$m]) && strpos($weight_unit[$m], $unit) !== FALSE) {
          $product["weight"] = preg_replace("/[A-Za-z]/", '', $weight_unit[$m]);
          $product["unit"] = $unit;
          break;
      }
   }
 }

你可以试试:

('d+X's?)?'d+'s?(LITRE|LTRS|LTR|LIT|GMS|LBS|KG|GM|GR|ML|OZ|LB|G|L)('d+X's?)?

如果您只希望这些单元匹配。regex:

  • ('d+X's?)? -可以与X (10X等)匹配一个或多个数字,
  • 'd+'s? -一个或多个数字和一个或无空白字符
  • (LITRE|LTRS|LTR|LIT|GMS|LBS|KG|GM|GR|ML|OZ|LB|G|L) -替代品你的单位,
  • ('d+X's?)? -可以在单位
  • 后面用X匹配一个或多个数字
演示

用一个正则表达式来完成所有这些可能不值得。也许你可以让它工作,但下一个工作的人将会有一段艰难的时间,除非她习惯于对着调制解调器吹口哨。:-)让我们试试一系列嵌套循环。

$txt = "GLENSTAL EXTRA MATURE COL CHEDDAR 200 GMS, ORIGINAL WAFFLES CO. ENGLISH 130G, LIFCO-SHREDDED MOZAREAL-500GM, CAPRICON TASTY BREAD -BIG, LUSINE MULTI GRAIN SLICED BREAD, ORGANIC MIXED FRUITS JUICE 10X200ML, COLA 330ML(016) PHOENIX ORGANIC, FRUITS JUICE 10X 200ML, ORGANIC FRUITS JUICE 500ML10X";   
$units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
/* break up your string at the commas, so you handle each item by itself */
$items = preg_split("/'s*,'s*/", $txt);
/* work through the items one by one */
foreach ($items as $item) {
    $amtnum = 1;
    $amtunit = "";
    $packnum = "1";
    /* break up the item description into tokens, where 
     * each number string and letter string gets its own token.
     * deal with (123) parenthesized number strings as well.
     *   e.g.   "FRUITS JUICE" "10" "X" "200" "ML"
     *   and    "COLA" "330" "ML" "(016)" "PHOENIX ORGANIC"
     */
    $toks = preg_split("/('('d+')|'d+|[^'d'(')]+)/", $item,-1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
    /* work backward through array of tokens, using array_pop */
    while ($tok = array_pop($toks)) {
        /* is the present token in your array of units? */
        if (in_array(strtoupper($tok), $units)) {
            /* yes. grab next token as the number of units */
            $amtunit = $tok;
            $amtnum = array_pop($toks);
        }
        /* is this an X (for a 16X pack or some such thing ? */
        if ($tok == 'X') {
            /* yes, grab next token as the number of items in the pack */
            $packnum = array_pop($toks);
        }
        /* do what you will with the result */
        echo $amtnum, $amtunit, $packnum;
    }
}

这一行是解决你问题的关键。

    $toks = preg_split(
            "/('('d+')|'d+|[^'d'(')]+)/", 
            $item,-1, 
            PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

preg_split将字符串拆分为数组。PREG_SPLIT_DELIM_CAPTURE作为修饰符意味着在结果数组中包含来自正则表达式的内容。PREG_SPLIT_NO_EMPTY表示在结果数组中不包含空字符串。

让我们看看正则表达式本身。我要加上空格,以便于阅读。

(  '('d+')  |  'd+  |  [^'d'(')]+  )  

以圆括号()开头和结尾。这与PREG_SPLIT_DELIM_CAPTURE一致。

然后包含三个可选的匹配表达式,以|分隔。

第一个是圆括号、一个数字和一个圆括号。它与测试数据集中的字符串(016)匹配。

第二个是一个普通数字。可以匹配"300"之类的内容

第三个是由字母、空格等组成的字符串,除了数字和括号。例如,匹配"GMS"answers"FRUITS JUICE"。

这可能是使用正则表达式完成解析工作的一种相当健壮的方法。