我有一段PHP代码如下:
$words = array(
'Art' => '1',
'Sport' => '2',
'Big Animals' => '3',
'World Cup' => '4',
'David Fincher' => '5',
'Torrentino' => '6',
'Shakes' => '7',
'William Shakespeare' => '8'
);
$text = "I like artists, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";
$all_keywords = $all_keys = array();
foreach ($words as $word => $key) {
if (strpos(strtolower($text), strtolower($word)) !== false) {
$all_keywords[] = $word;
$all_keys[] = $key;
}
}
echo $keywords_list = implode(',', $all_keywords) ."<br>";
echo $keys_list = implode(',', $all_keys) . "<br>";
代码回显Art,Sport,World Cup,Shakes,William Shakespeare
和1,2,4,7,8
;但是,代码非常简单,而且不够准确,无法返回正确的关键字。例如,由于$text
中的Shakespeare
字,代码返回'Shakes' => '7'
,但是正如您所看到的,"Shakes"不能将"Shakespeare"表示为合适的关键字。基本上我想返回Art,Sport,World Cup,William Shakespeare
和1,2,4,8
,而不是Art,Sport,World Cup,Shakes,William Shakespeare
和1,2,4,7,8
。所以,你能帮我如何开发一个更好的代码来提取关键字没有类似的问题吗?谢谢你的帮助。
您可能需要查看正则表达式以清除部分匹配:
// create regular expression by using alternation
// of all given words
$re = '/'b(?:' . join('|', array_map(function($keyword) {
return preg_quote($keyword, '/');
}, array_keys($words))) . ')'b/i';
preg_match_all($re, $text, $matches);
foreach ($matches[0] as $keyword) {
echo $keyword, " ", $words[$keyword], "'n";
}
表达式使用'b
断言来匹配单词边界,即单词必须单独存在。
World Cup 4
William Shakespeare 8
如果需要精确匹配,最好使用正则表达式。我修改了你的原始代码,使用它们而不是strpos()
,因为它会导致部分匹配,就像你的代码一样。还有改进的空间,但希望你能了解它的基本要点。
如果你有任何问题请告诉我。
代码被修改为shell脚本,所以保存到demo.php和chmod +x demo.php &&。/demo.php
'#!/usr/bin/php
//array of regular expressions to match your words/phrases
$words = array(
'/'b[Aa]rt'b/',
'/'bI'b/',
'/'bSport'b/',
'/'bBig Animals'b/' ,
'/'bWorld Cup'b/' ,
'/'bDavid Fincher'b/',
'/'bTorrentino'b/' ,
'/'bShakes'b/' ,
'/'b[sS]port[s]{0,1}'b/' ,
'/'bWilliam Shakespeare'b/',
);
$text = "I like artists and art, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";
$all_keywords = array(); //changed formatting for clarity
$all_keys = array();
foreach ($words as $regex) {
$m = array();
if (preg_match_all($regex, $text, $m, PREG_OFFSET_CAPTURE)>=1)
for ($n=0;$n<count($m); ++$n) {
$match = $m[0];
foreach($match as $mm) {
$key = $mm[1]; //key is the offset in $text where the match begins
$word = $mm[0]; //the matched word/phrase
$all_keywords[] = $word;
$all_keys[] = $key;
}
}
}
echo "'$text = '"$text'"'n";
echo $keywords_list = implode(',', $all_keywords) ."<br>'n";
echo $keys_list = implode(',', $all_keys) . "<br>'n";
"
Replace
strpos(strtolower($text), strtolower($word)
preg_match('/'b'.$word.''b/',$text)
或者,因为你似乎不关心大写字母:
preg_match('/'b'.strtolower($word).''b/', strtolower($text))
在这种情况下,我建议您提前执行strtolower($text)
,例如在foreach
开始之前。
从我的头脑中,我认为还有两个额外的步骤使这个函数更健壮。
- 如果我们以某种方式对$words数组进行strlen排序(降序,顶部较大的单词,底部较小的单词),则会有更大的机会获得所需的"匹配"。
- 在for循环中,当单词"matches"或strcmp返回true时,我们可以从字符串中删除匹配的单词,以避免进一步不必要的匹配。(例如,shake总是匹配William Shakespeare匹配的地方。)
注:ios应用太棒了!但仍然不容易编写代码(该死的自动更正!)