PHP -文本字符串中的关键字匹配-如何提高返回关键字的准确性


PHP - Keyword matching in text strings - How to enhance the accuracy of returned keywords?

我有一段PHP代码如下:

$words = array(
    'Art' => '1',
    'Sport' => '2',
    'Big Animals' => '3',
    'World Cup' => '4',
    'David Fincher' => '5',
    'Torrentino' => '6',
    'Shakes' => '7',
    'William Shakespeare' => '8'
    );
$text = "I like artists, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";
$all_keywords = $all_keys = array();
foreach ($words as $word => $key) {
    if (strpos(strtolower($text), strtolower($word)) !== false) {
        $all_keywords[] = $word;
        $all_keys[] = $key;
    }
}
        echo $keywords_list = implode(',', $all_keywords) ."<br>";
        echo $keys_list = implode(',', $all_keys) . "<br>";

代码回显Art,Sport,World Cup,Shakes,William Shakespeare1,2,4,7,8;但是,代码非常简单,而且不够准确,无法返回正确的关键字。例如,由于$text中的Shakespeare字,代码返回'Shakes' => '7',但是正如您所看到的,"Shakes"不能将"Shakespeare"表示为合适的关键字。基本上我想返回Art,Sport,World Cup,William Shakespeare1,2,4,8,而不是Art,Sport,World Cup,Shakes,William Shakespeare1,2,4,7,8。所以,你能帮我如何开发一个更好的代码来提取关键字没有类似的问题吗?谢谢你的帮助。

您可能需要查看正则表达式以清除部分匹配:

// create regular expression by using alternation
// of all given words
$re = '/'b(?:' . join('|', array_map(function($keyword) {
    return preg_quote($keyword, '/');
}, array_keys($words))) . ')'b/i';
preg_match_all($re, $text, $matches);
foreach ($matches[0] as $keyword) {
    echo $keyword, " ", $words[$keyword], "'n";
}

表达式使用'b断言来匹配单词边界,即单词必须单独存在。

World Cup 4
William Shakespeare 8

如果需要精确匹配,最好使用正则表达式。我修改了你的原始代码,使用它们而不是strpos(),因为它会导致部分匹配,就像你的代码一样。还有改进的空间,但希望你能了解它的基本要点。

如果你有任何问题请告诉我。

代码被修改为shell脚本,所以保存到demo.phpchmod +x demo.php &&。/demo.php


'#!/usr/bin/php

//array of regular expressions to match your words/phrases
$words = array(
    '/'b[Aa]rt'b/',
    '/'bI'b/',
    '/'bSport'b/',
    '/'bBig Animals'b/' ,
    '/'bWorld Cup'b/' ,
    '/'bDavid Fincher'b/',
    '/'bTorrentino'b/' ,
    '/'bShakes'b/' ,
    '/'b[sS]port[s]{0,1}'b/' ,
    '/'bWilliam Shakespeare'b/',
);
$text = "I like artists and art, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";
$all_keywords = array();  //changed formatting for clarity
$all_keys     = array();
foreach ($words as $regex) {
  $m = array();
  if (preg_match_all($regex, $text, $m, PREG_OFFSET_CAPTURE)>=1)
    for ($n=0;$n<count($m); ++$n) { 
      $match = $m[0];
      foreach($match as $mm) {         
        $key = $mm[1];          //key is the offset in $text where the match begins
        $word = $mm[0];         //the matched word/phrase
        $all_keywords[] = $word;
        $all_keys[] = $key;
      }
    }
}
echo "'$text = '"$text'"'n";
echo $keywords_list = implode(',', $all_keywords) ."<br>'n";
echo $keys_list = implode(',', $all_keys) . "<br>'n";

"

Replace

strpos(strtolower($text), strtolower($word)

preg_match('/'b'.$word.''b/',$text)

或者,因为你似乎不关心大写字母:

preg_match('/'b'.strtolower($word).''b/', strtolower($text))

在这种情况下,我建议您提前执行strtolower($text),例如在foreach开始之前。

从我的头脑中,我认为还有两个额外的步骤使这个函数更健壮。

  • 如果我们以某种方式对$words数组进行strlen排序(降序,顶部较大的单词,底部较小的单词),则会有更大的机会获得所需的"匹配"。
  • 在for循环中,当单词"matches"或strcmp返回true时,我们可以从字符串中删除匹配的单词,以避免进一步不必要的匹配。(例如,shake总是匹配William Shakespeare匹配的地方。)

注:ios应用太棒了!但仍然不容易编写代码(该死的自动更正!)