使用另一个中的字符串搜索一个80GB的文本文件,并将每个字符串的结果保存到单独的文件中


Search an 80 GB text file using strings in another and save the results for each string to separate files

我有一个巨大的80GB文件,我需要在另一个较小的文本文件中使用字符串进行搜索,然后(这里是踢球者(我需要将每个匹配行的结果保存到用搜索字符串命名的单独文件中。

用PHP或AWK处理这个任务最有效的方法是什么?

示例行:

原始80GB文本文件:

line1 "value001","value002","Value003"
line2 "Value004","Value005","Value006","Value007"
line3 "value001","value002","Value003"
line4 "value001","value002","Value003"
line5 "value001","value002","Value003"
line6 "Value004","Value005","Value006","Value007"
line7 "value010","value022","Value009"

搜索字符串文本文件search.txt包含以下值:

Value003
Value007
Value009

三个文本文件将包含每个搜索字符串的所有匹配行:

Value003.txt would contain lines 1, 3, 4, 5
Value007.txt would contain lines 2 and 6
Value009.txt would contain line 7

补充说明:确切地说,字符串是域和电话号码的列表,如:

joes.com
brick.net
moes.com
sams.net 
2125551212 
2025551212
(202)555-1212

目前,我在文本板中使用一个长regex字符串进行搜索,如下所示:

brick.net|joes.com|moes.com|sams.net|2125551212|2025551212|(202)555-1212

这种搜索既繁琐又缓慢,而且会导致相当多的误报,比如"sams网络"answers"黄砖网络"。

我正在努力捕捉诸如sam@sam.net但不是"sams网络"。

Bash和grep

在搜索文件上循环并对每一行进行grepping,将结果重定向到正确命名的文件:

while read str; do grep -F "$str" infile > "$str".txt; done < search.txt

其中infile是您的大文件。这会产生以下文件:

==> Value003.txt <==
line1"value001","value002","Value003"
line3"value001","value002","Value003"
line4"value001","value002","Value003"
line5"value001","value002","Value003"
==> Value007.txt <==
line2"Value004","Value005","Value006","Value007"
line6"Value004","Value005","Value006","Value007"
==> Value009.txt <==
line7"value010","value022","Value009"

请注意,这会多次处理非常大的文件,即使grep很快,但使用Bash在文件上循环也很慢,因此只有当search.txt相对较小时,这才可行。

Awk

要只处理一次大文件,可以使用awk对其进行迭代,并对每一行检查是否有任何字符串匹配:

#!/usr/bin/awk -f
# Read search file into array
NR == FNR {
    searchstr[$0]
    next
}
{
    # Iterate over search strings
    for (str in searchstr) {
        # Print to file if matches
        if (index($0, str)) {
            print $0 > str ".txt"
            # next  # Uncomment if only one search string can occur per line
            # close(str ".txt") # Uncomment if there are too many open files
        }
    }
}

这必须被称为:

awk -f script.awk search.txt infile

在可读性较差的单行版本中:

awk 'NR==FNR{ss[$0];next}{for(s in ss)if(index($0,s))print$0>s".txt"}' search.txt infile

请注意,一些awk对打开的文件句柄的数量有限制1,而其他awk(GNU awk(可以管理更多的文件句柄,但速度会超过该限制–这取决于search.txt的大小。如果出现问题,我们可以将close(str ".txt")添加到if子句中,以便在每次写入后关闭文件。

如果每行只能出现一个搜索字符串,我们可以在循环中取消对next语句的注释。


1原始awk的打开文件限制为15个!

如果你的输入真的如图所示,那么GNU awk所需要的就是:

NR==FNR{s=(s ? s "|" : "") $0; next} match($0,s,a){print > (a[0] ".txt")}

例如:

$ awk 'NR==FNR{s=(s ? s "|" : "") $0; next} match($0,s,a){print $0 "'t> " (a[0] ".txt")}' search.txt bigfile
line1"value001","value002","Value003"   > Value003.txt
line2"Value004","Value005","Value006","Value007"        > Value007.txt
line3"value001","value002","Value003"   > Value003.txt
line4"value001","value002","Value003"   > Value003.txt
line5"value001","value002","Value003"   > Value003.txt
line6"Value004","Value005","Value006","Value007"        > Value007.txt
line7"value010","value022","Value009"   > Value009.txt

如果这不起作用,因为你的输入并没有真正显示在你的问题中,那么,很明显,编辑你的问题,以显示一些更准确的代表性样本输入和输出。