
filtering words from text with exploits

我有一个过滤器,可以过滤像'ass' 'fuck'等脏话。现在我正在尝试处理像"f*ck","sh/t"这样的漏洞。


我可以做的另一件事是,使用levelshtein距离。levenshtein distance = 1的单词应该被屏蔽。但这种方法也容易给出假阳性。

if(!ctype_alpha($text)&& levenshtein('shit', $text)===1)

我正在寻找一些使用正则表达式的方法。也许我可以把levenshtein distance和regex结合起来,但是我算不出来。


它可能会给你一个如何解决这个问题的总体思路,尽管如果你想让它更智能,还需要更多的逻辑。例如,这个过滤器不会过滤"fuck","fuck","f**ck","fck","fck"。Fuck '(带前导点)或' f ck',而它可能会过滤掉'++++'以取代'beep'。但它也过滤了"f*ck"、"f**k"、"f** king"answers"sh1t",所以它可能会做得更糟。:)


$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);
// Loop through all words.
foreach ($words as $word)
  $naughty = false;
  // Match each bad word against each word.
  foreach ($badwords as $badword)
    // If the word is shorter than the bad word, it's okay. 
    // It may be bigger. I've done this mainly, because in the example given, 
    // 'f*ck,' will contain the trailing comma. This could be easily solved by
    // splitting the string a bit smarter. But the added benefit, is that it also
    // matches derivatives, like 'f*cking' or 'f*cker', although that could also 
    // result in more false positives.
    if (strlen($word) >= strlen($badword))
      $wordOk = false;
      // Check each character in the string.
      for ($i = 0; $i < strlen($badword); $i++)
        // If the letters don't match, and the letter is an actual 
        // letter, this is not a bad word.
        if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
          $wordOk = true;
      // If the word is not okay, break the loop.
      if (!$wordOk)
        $naughty = true;
  // Echo the sensored word.
  echo $naughty ? 'beep ' : ($word . ' ');