在不耗尽内存的情况下迭代Mongo结果


Iterate over Mongo results without running out of memory

我需要在每个文档的名称/描述/标签等中找到关键字,如果找到则删除它们。我是Mongo的新手,所以我在现有的代码库中使用类似的脚本。首先,获取MongoCursor,并且只获取我们要检查的字段:

    /** @var MongoCursor $products */
    $products = $collection->find(
        ['type' => ['$in' => ['PHONES', 'TABLETS']], 'supplier.is_awful' => ['$exists' => true]],
        ['details.name' => true, 'details.description' => true]
    );

然后,遍历每个文档,然后检查我们感兴趣的值的每个属性:

/** @var 'Doctrine'ODM'MongoDB'DocumentManager $manager */
$manager = new Manager();
foreach ($products as $product) {
    // Find objectionable words in the content and remove these documents
    foreach (["suckysucky's", "deuce", "a z z"] as $word) {
        if (false !== strpos(mb_strtolower($product['details']['name']), $word)
          || false !== strpos(mb_strtolower($product['details']['description']), $word)) {
                $object = $manager->find('App'Product::class, $product['_id']);
                $manager->remove($object);
        }
    }
}
// Persist to DB
$manager->flush();

问题是数据库有数十万条记录,并且看起来在MongoCursor上迭代,内存使用量越来越大,直到耗尽:

Now at (0) 20035632
Now at (100) 24446048
Now at (200) 32190312
Now at (300) 36098208
Now at (400) 42433656
Now at (500) 45204376
Now at (600) 50664808
Now at (700) 54916888
Now at (800) 59847312
Now at (900) 65145808
Now at (1000) 70764408

是否有一种方法可以让我迭代MongoCursor而不会耗尽内存(我曾尝试在不同的点取消设置各种对象,但没有运气)?或者,这是可以直接在Mongo中运行的查询吗?我看过文档,我看到了$text的一些希望,但看起来我需要在那里有一个索引(我没有),每个集合只能有一个文本索引。

您不需要全文索引来查找子字符串:正确的方法是使用正则表达式,然后只返回"_id"值,如:

$mongore = new MongoRegex("/suckysucky's|deuce|a z z/i")
$products = $collection->find(
    ['type' => ['$in' => ['PHONES', 'TABLETS']], 
     'supplier.is_awful' => ['$exists' => true],
     '$or': [['details.name' => $mongore],
             ['details.description' => $mongore]]]
    ['_id' => true]
);

我不确定确切的PHP语法,但关键是一个包含$或过滤器在两个字段上具有相同的mongodb regex。