在20mb平面文件数据库(PHP）中搜索完整单词的最快方法 - Fastest way to search for whole words in 20mb flat file database (PHP)

Fastest way to search for whole words in 20mb flat file database (PHP)

我有一个20MB的平面文件数据库，大约有500k行，只允许[a-z0-9-]个字符，平均每行7个字，没有空的或重复的行：

平面文件数据库：

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

我正在搜索whole words only并从这个数据库中提取first 10k results。

到目前为止，如果在数据库的前20k行中找到10k个匹配项，则此代码可以正常工作，但如果单词很少，则脚本必须搜索所有500k行，这会慢10倍。

设置：

$cats = file("cats.txt", FILE_IGNORE_NEW_LINES);
$search = "end";
$limit = 10000;

搜索：

foreach($cats as $cat) {
    if(preg_match("/'b$search'b/", $cat)) {
        $cats_found[] = $cat;
        if(isset($cats_found[$limit])) break;
    }
}

我的php技能和知识有限，我不能也不知道如何使用sql，所以这是我能做的最好的，但我需要一些建议：

这是正确的代码吗？foreach和preg_match有问题吗
我应该将大文件拆分为小文件吗？如果是，大小是多少
最后，sql会更快吗？会快多少？（未来选项）

谢谢你读到这篇文章，很抱歉英语不好，这是我的第三语言。

如果大多数行不包含搜索到的单词，则可以减少执行preg_match()的频率，如下所示：

foreach ($lines as $line) {
    // fast prefilter...
    if (strpos($line, $word) === false) {
        continue;
    }
    // ... then proper search if the line passed the prefilter
    if (preg_match("/'b{$word}'b/", $line)) {
        // found
    }
}

不过，它需要在实际情况中进行基准测试。

这将适用于逐行读取，尽管您可能会耗尽内存：

（可能需要调整php.ini memory_limit和max_execution_time或通过cli运行）

$rFile = fopen( 'inputfile.txt', 'r' );
$iLineNumber = 0;
$sSearch = '123';
$iLimit  = 5000;
while( !feof( $rFile ) )
{
    if( $iLineNumber > $iLimit )
    {
        break;
    }
    $sLine = fgets( $rFile );
    if( preg_match("/'b$sSearch'b/", $sLine, $aMatches ) ) 
    {
        $aCats[] = $aMatches[ 0 ];
    }
    ++$iLineNumber;
}
var_dump( $aCats );

我的建议是将文件重新格式化为sql导入并使用数据库。平面文件搜索速度明显较慢。

Infile:

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
123
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
123
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

输出：

array(2) {
  [0]=>
  string(3) "123"
  [1]=>
  string(3) "123"
}

它从匹配中包装了一个额外的数组，所以我们必须使用[0]