计数总数和唯一的单词从数千个文件 - Count total and unique words from thousands of files

我有一个超过5000个文本文件的大集合，有超过20万个单词。问题是，当我试图将整个集合组合成单个数组以找到集合中的唯一单词时，没有显示输出(这是由于数组的大小非常大)。下面的代码段可以很好地用于小no。集合，例如，30个文件，但不能在非常大的集合上操作。帮我解决这个问题。由于

<?php
ini_set('memory_limit', '1024M');
$directory = "archive/";
$dir = opendir($directory);
$file_array = array(); 
while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {
    $contents = file_get_contents($filename);
    $text = preg_replace('/'s+/', ' ',  $contents);
    $text = preg_replace('/[^A-Za-z0-9'-'n ]/', '', $text);
    $text = explode(" ", $text);
    $text = array_map('strtolower', $text);
    $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to");
    $text = (array_diff($text,$stopwords));
    $file_array = array_merge($file_array,  $text);
  }
}
closedir($dir); 
$total_word_count = count($file_array);
$unique_array = array_unique($file_array);
$unique_word_count = count($unique_array);
echo "Total Words: " . $total_word_count."<br>";
echo "Unique Words: " . $unique_word_count;
?>

文本文件的数据集可以在这里找到:https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip

不需要处理多个数组，只需构建一个数组，并只使用单词填充它，并在插入它们时对它们进行计数。这样会更快，你甚至可以得到每个单词的计数。

顺便说一下，您还需要将空字符串添加到停止词列表中，或者调整您的逻辑以避免将该字符串包含在内。

<?php
$directory = "archive/";
$dir = opendir($directory);
$wordcounter = array();
while (($file = readdir($dir)) !== false) {
  if (filetype($directory . $file) == 'file') {
    $contents = file_get_contents($directory . $file);
    $text = preg_replace('/'s+/', ' ',  $contents);
    $text = preg_replace('/[^A-Za-z0-9'-'n ]/', '', $text);
    $text = explode(" ", $text);
    $text = array_map('strtolower', $text);
    foreach ($text as $word)
        if (!isset($wordcounter[$word]))
            $wordcounter[$word] = 1;
        else
            $wordcounter[$word]++;
  }
}
closedir($dir); 
$stopwords = array("", "a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to");
foreach($stopwords as $stopword)
    unset($wordcounter[$stopword]);
$total_word_count = array_sum($wordcounter);
$unique_word_count = count($wordcounter);
echo "Total Words: " . $total_word_count."<br>";
echo "Unique Words: " . $unique_word_count."<br>";
// bonus:
$max = max($wordcounter);
echo "Most used word is used $max times: " . implode(", ", array_keys($wordcounter, $max))."<br>";
?>

为什么要将所有数组组合成一个无用的大数组?

您可以使用array_unique函数从数组中获取唯一值，然后将它与文件中的下一个数组连接并再次应用相同的函数

不要将内存限制设置为高。这通常不是最好的解决方案。

您应该做的是逐行加载文件(这在PHP中处理CSV格式时很容易)，计算单行(或一小部分行)并写入输出文件。这样你就可以用很少的内存来处理大量的输入数据。

在任何情况下，尝试找到一种方法将完整的输入分割成更小的块，即使不增加内存限制也可以处理。

另一种方法是将所有内容加载到db表中，然后让数据库服务器处理大部分内容。