搜索没有误报的匹配单词 - Search for matching words without false positivis

Search for matching words without false positivis

本文关键字：单词搜索 | 更新日期: 2023-09-27

我找到了这个链接，正在处理它，但我需要进一步扩展它。检查字符串是否包含数组中的单词

我正在尝试创建一个脚本，用于检查网页中是否存在已知的脏话。我有一个数组，其中有一个坏单词列表，它将它与file_get_contents中的字符串进行比较。

这在基本级别上有效，但会返回误报。例如，如果我正在加载一个包含单词"title"的网页，它会返回找到单词"tit"的信息。

我最好去掉所有的html和标点符号，然后根据空格将其分解，并将每个单词放入一个数组中吗？我希望有一个更有效的过程。

这是我到目前为止的代码：

$url = 'http://somewebsite.com/';
$content = strip_tags(file_get_contents($url));
//list of bad words separated by commas
$badwords = 'tit,butt,etc'; //this will eventually come from a db
$badwordList = explode(',', $badwords);
foreach($badwordList as $bad) {
    $place = strpos($content, $bad);
    if (!empty($place)) {
        $foundWords[] = $bad;
    }
}
print_r($foundWords);

提前感谢！

您可以将正则表达式与preg_match_all():一起使用

$badwords = 'tit,butt,etc'; 
$regex = sprintf('/'b(%s)'b/', implode('|', explode(',', $badwords)));
if (preg_match_all($regex, $content, $matches)) {
    print_r($matches[1]);
}

第二条语句创建正则表达式，我们使用它来匹配和捕获网页中所需的单词。首先，它用逗号分隔$badwords字符串，并用|连接它们。然后，这个结果字符串被用作如下的模式：/'b(tits|butt|etc)'b/。'b（它是一个单词边界）将确保只有整个单词是匹配的。

这个正则表达式模式将匹配这些单词中的任何一个，并且在网页中找到的单词将存储在数组$matches[1]中。