我有一个巨大的80GB文件,我需要在另一个较小的文本文件中使用字符串进行搜索,然后(这里是踢球者(我需要将每个匹配行的结果保存到用搜索字符串命名的单独文件中。
用PHP或AWK处理这个任务最有效的方法是什么?
示例行:
原始80GB文本文件:
line1 "value001","value002","Value003"
line2 "Value004","Value005","Value006","Value007"
line3 "value001","value002","Value003"
line4 "value001","value002","Value003"
line5 "value001","value002","Value003"
line6 "Value004","Value005","Value006","Value007"
line7 "value010","value022","Value009"
搜索字符串文本文件search.txt
包含以下值:
Value003
Value007
Value009
三个文本文件将包含每个搜索字符串的所有匹配行:
Value003.txt would contain lines 1, 3, 4, 5
Value007.txt would contain lines 2 and 6
Value009.txt would contain line 7
补充说明:确切地说,字符串是域和电话号码的列表,如:
joes.com
brick.net
moes.com
sams.net
2125551212
2025551212
(202)555-1212
目前,我在文本板中使用一个长regex字符串进行搜索,如下所示:
brick.net|joes.com|moes.com|sams.net|2125551212|2025551212|(202)555-1212
这种搜索既繁琐又缓慢,而且会导致相当多的误报,比如"sams网络"answers"黄砖网络"。
我正在努力捕捉诸如sam@sam.net但不是"sams网络"。
Bash和grep
在搜索文件上循环并对每一行进行grepping,将结果重定向到正确命名的文件:
while read str; do grep -F "$str" infile > "$str".txt; done < search.txt
其中infile
是您的大文件。这会产生以下文件:
==> Value003.txt <==
line1"value001","value002","Value003"
line3"value001","value002","Value003"
line4"value001","value002","Value003"
line5"value001","value002","Value003"
==> Value007.txt <==
line2"Value004","Value005","Value006","Value007"
line6"Value004","Value005","Value006","Value007"
==> Value009.txt <==
line7"value010","value022","Value009"
请注意,这会多次处理非常大的文件,即使grep很快,但使用Bash在文件上循环也很慢,因此只有当search.txt
相对较小时,这才可行。
Awk
要只处理一次大文件,可以使用awk对其进行迭代,并对每一行检查是否有任何字符串匹配:
#!/usr/bin/awk -f
# Read search file into array
NR == FNR {
searchstr[$0]
next
}
{
# Iterate over search strings
for (str in searchstr) {
# Print to file if matches
if (index($0, str)) {
print $0 > str ".txt"
# next # Uncomment if only one search string can occur per line
# close(str ".txt") # Uncomment if there are too many open files
}
}
}
这必须被称为:
awk -f script.awk search.txt infile
在可读性较差的单行版本中:
awk 'NR==FNR{ss[$0];next}{for(s in ss)if(index($0,s))print$0>s".txt"}' search.txt infile
请注意,一些awk对打开的文件句柄的数量有限制1,而其他awk(GNU awk(可以管理更多的文件句柄,但速度会超过该限制–这取决于search.txt
的大小。如果出现问题,我们可以将close(str ".txt")
添加到if
子句中,以便在每次写入后关闭文件。
如果每行只能出现一个搜索字符串,我们可以在循环中取消对next
语句的注释。
1原始awk的打开文件限制为15个!
如果你的输入真的如图所示,那么GNU awk所需要的就是:
NR==FNR{s=(s ? s "|" : "") $0; next} match($0,s,a){print > (a[0] ".txt")}
例如:
$ awk 'NR==FNR{s=(s ? s "|" : "") $0; next} match($0,s,a){print $0 "'t> " (a[0] ".txt")}' search.txt bigfile
line1"value001","value002","Value003" > Value003.txt
line2"Value004","Value005","Value006","Value007" > Value007.txt
line3"value001","value002","Value003" > Value003.txt
line4"value001","value002","Value003" > Value003.txt
line5"value001","value002","Value003" > Value003.txt
line6"Value004","Value005","Value006","Value007" > Value007.txt
line7"value010","value022","Value009" > Value009.txt
如果这不起作用,因为你的输入并没有真正显示在你的问题中,那么,很明显,编辑你的问题,以显示一些更准确的代表性样本输入和输出。