如何通过关键字识别相似字符串


how to identify similar string via keywords

关键字:所有超过3个字符的单词

我想比较两个字符串之间的关键字,条件如下:

  1. 移动单词并不重要(示例1适用于这种情况(
  2. 少于3个字符的单词不计算(示例2适用于这种情况(
  3. 把较短的句子放在str1中(字符数((示例3适用于这种情况(
  4. 我只想在str1和str2中使用不同的单词(示例4适用于这种情况(

事实上,我有一个机器人,它每天攻击两个新闻网站,并将新闻复制到我的数据库中。然后我需要一个算法来比较新闻标题并识别重复的新闻。(正如你所知,同一条新闻在不同的新闻网站上有不同的标题。但通常,同一新闻的标题包含相同的关键词(

示例1:移动单词并不重要

str1= 'hello petter'
str2= 'petter hello'
result: 0 

示例2:少于3个字符的单词不计算

str1= 'hello !!'
str2= 'petter hello'
result: 0 // '!!' are less than 3characters and str1 is 'hello'. then result:0

str1= 'hello petter‌ how are u?'
str2= 'petter hello how are you'
result: 0 // str1 is 'hello petter how are'

示例3:必须更改变量

str1= 'hello petter‌ how are you ?'
str2= 'petter hello how are you?'
// Then
str1= 'hello petter‌ how are you?'
str2= 'petter hello how are you ?'
result:1 // 1 is for 'you' (in str1)

示例4:不同的单词在str2 中并不重要

str1= 'hello petter‌ how are you?'
str2= 'petter hello how are you ?'
result: 1 // str2 is 'petter hello how are you', then 1 is for: 'you?' (in str1)

注意:"you"(在str2中(对我来说并不重要,因为它不匹配带有str1的任何单词。

咒骂示例:(了解更多信息(

str1= 'petter‌ hello how are you pal?'
str2= 'petter hello how are... !!'
// In first str1 change with str2
str1= 'petter hello how are... !!'
str2= 'petter‌ hello how are you pal?'
// Then remove '!!' (in str1)
str1= 'petter hello how are...'
str2= 'petter‌ hello how are you pal?'
result: 1 // 1 for 'are...' (in str1) - ['are','you','pal?' does not matter (in str2)]

最后,我需要一个函数来通过结果和关键字的数量来识别重复新闻(所有超过3个字符的单词(

$keywords_numb=7;
$result=2;
function identify_duplicate($keywords_numb,$result){
    if($keywords_numb / 3 >= $result){
        $Specified = 'this is a new news';
    }
    else $Specified = 'this is a duplicate news';
        return $Specified;
}
    echo $Specified;

输出:

this is a new news

有人知道我该怎么写这个程序吗?问候

您不需要regex。。您可以使用以下函数并按任何顺序传递字符串:

function identify_duplicate($var1, $var2){
   if(strlen($var1)>=strlen($var2)){
       $str1 = $var1;
       $str2 = $var2;
   }
   else{
       $str1 = $var2;
       $str2 = $var1;
   }
   $str1 = explode(" ", $str1);
   $str2 = explode(" ", $str2);
  $return = sizeof($str1);
  foreach($str1 as $val){
     if(in_array($val, $str2) || strlen($val) <= 3){
         $return = $return - 1;
     }
  }
   return $return;
}

@karthik manchala的帮助下,我做到了。。。

   $str1='this news is about a player named Ronaldo';
   $str2='The player who called Ronaldo';
 function identify_duplicate($str1, $str2){
   if(strlen($str1)>strlen($str2)){
       list($str1, $str2) = array($str2, $str1); // swap two variables
   }
   $str1 = explode(" ", $str1);
   $str2 = explode(" ", $str2);
    $words_numb = sizeof($str1);
    $result=$words_numb;
    foreach($str1 as $val){
     if(in_array($val, $str2) || strlen($val) <= 3){
         $result--;
     }
  }
   if($words_numb / 3 >=$result){
        $Specified = 'this is a duplicate news';
       }
    else $Specified = 'this is a new news';
        return $Specified;
}

$out=identify_duplicate($str1, $str2);
echo $out;

输出:

这是一个重复的新闻