如何在目录中超过1000万个文件上运行sed？

16

我有一个包含10144911文件的目录。到目前为止，我已经尝试了以下方法：

for f in ls; do sed -i -e 's/blah/blee/g' $f; done

摔坏了我的贝壳，ls它在一个tilda中，但是我不知道该怎么做。

ls | xargs -0 sed -i -e 's/blah/blee/g'

的args太多 sed

find . -name "*.txt" -exec sed -i -e 's/blah/blee/g' {} \;

无法再分叉没有更多的内存

关于如何创建这种命令还有其他想法吗？这些文件不需要相互通信。ls | wc -l似乎可以正常运行（非常慢），因此必须可行。

bash find xargs

— 桑德罗
source

1

如果可以避免sed为每个文件调用都会更快。我不确定是否有一种方法可以打开，编辑，保存和关闭;中的一系列文件sed。如果速度至关重要，则可能需要使用其他程序，例如perl或python。

— 直觉

@intuited：根本不对文件做任何事情会更快甚至更严重吗？如果要更改一组文件中的模式，则必须查看每个文件以查看是否存在模式。如果您事先知道可以跳过“某些”文件，那么即使不触摸文件，显然也更快。并且的启动时间sed可能比启动python或启动的时间还快perl，除非您在该解释器中进行了所有操作。

— akira

@akira：您是说要为命令行中尽可能多的文件启动一次perl或python比为每个文件一次启动sed昂贵吗？如果真是那样，我真的会感到惊讶。——————我猜你不明白我的建议是调用（启动）编辑程序一次（或至少减少几次，请参阅我的答案），然后打开，修改和重新保存每个文件反过来，而不是分别为每个文件调用编辑程序。

— 直觉

您的第一条评论并没有反映出您真正想说的话：“通过python / perl替换sed” ..只是这样做并查看@命令行OP给出的内容，无辜的读者可能会认为“ find。-exec python”是速度比“查找.-exec sed” ..明显快。在您自己的答案中，调用python的次数远远超过了实际需要的次数。

— akira

我认为akira误解了您的（建议）建议。我相信您建议将文件捆绑在一起。我用xargs尝试了一下，请再试一次:)

— Sandro

19

试试看：

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

每次调用只会提供一个文件名sed。这将解决“ sed的参数过多”的问题。该-P选项应允许同时分叉多个进程。如果0不起作用（应该尽可能多地运行），请尝试其他数字（10？100？您拥有的内核数？）以限制该数目。

— 暂停，直到另行通知。
source

3

可能需要find . -name \*.txt -print0避免外壳程序扩展glob并尝试分配空间以查找一千万个参数。

— 克里斯·约翰森

@ChrisJohnsen：是的，这是正确的。我急于发布我的答案，却错过了那些必要的部分。我已经用这些更正来编辑答案。谢谢。

— 暂停，直到另行通知。

现在就尝试... 交叉手指

— Sandro

7

我已经在1000 万个（空）文件上测试了此方法（以及所有其他方法），这些文件名为“ hello 00000001”到“ hello 10000000”（每个名称14个字节）。

更新： 我现在包括了在该方法上运行的四核'find |xargs'（仍然没有“ sed”；只是echo> / dev / null）。

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done

这是总结针对上述测试数据运行时提供的答案的效果的摘要。这些结果仅涉及基本开销；即“ sed”未被调用。sed过程几乎肯定是最耗时的，但是我认为看看裸方法如何进行比较会很有趣。

丹尼斯'find |xargs'使用单核的bash array方法比no sed运行方法花费* 4个小时21分钟** 。但是，“查找”提供的多核优势应该超过要求sed时显示的时差。正在处理文件...

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+-----------------------------------------------------

— 彼得·奥
source

2

另一个完全安全的发现机会：

while IFS= read -rd $'\0' path
do
    file_path="$(readlink -fn -- "$path"; echo x)"
    file_path="${file_path%x}"
    sed -i -e 's/blah/blee/g' -- "$file_path"
done < <( find "$absolute_dir_path" -type f -print0 )

— 00
source

1

这大多是题外话，但您可以使用

find -maxdepth 1 -type f -name '*.txt' | xargs python -c '
import fileinput
for line in fileinput.input(inplace=True):
    print line.replace("blah", "blee"),
'

这里（超过... xargs ... -I {} ... sed ...）的主要好处是速度：您避免调用sed1000万次。如果您可以避免使用Python，它将仍然更快（因为python相对较慢），因此perl可能是此任务的更好选择。我不确定如何使用perl方便地进行等效操作。

这种工作方式是xargs在一个命令行中使用尽可能多的参数来调用Python，并继续这样做直到它用完参数（由提供ls -f *.txt）。每次调用的参数数量取决于文件名的长度以及其他一些内容。该fileinput.input函数从每个调用的参数中命名的文件中产生连续的行，并且该inplace选项告诉它神奇地“捕获”输出并使用它替换每一行。

请注意，Python的字符串replace方法不使用正则表达式。如果需要这些，则必须import re使用print re.sub(line, "blah", "blee")。它们是与Perl兼容的RegExp，它们是您获得的那些的强化版本sed -r。

编辑

正如akira在评论中提到的那样，使用glob（ls -f *.txt）代替find命令的原始版本不起作用，因为glob 由shell（bash）本身处理。这意味着在运行该命令之前，一千万个文件名将被替换为命令行。这几乎可以保证超过命令的参数列表的最大大小。您可以xargs --show-limits在此使用系统特定的信息。

还会考虑参数列表的最大大小，xargs它根据该限制来限制传递给每次python调用的参数数量。由于xargs仍然需要多次调用python，因此akira的建议os.path.walk用于获取文件列表可能会节省您一些时间。

— 直觉的
source

1

使用glob运算符有什么意义（无论如何对于许多文件都将失败）...然后将文件提供给具有的python os.path.walk()？

— akira

@akira：glob运算符是为了避免尝试替换.and 的内容..。当然，还有其他方法可以做到这一点（即find），但是我试图尽可能地坚持OP的理解。这也是不使用的原因os.path.walk。

— 直觉

@akira：不过，好的建议可能会快得多。

— 直觉

我认为OP会os.path.walk很容易理解。

— akira

0

尝试：

ls | while read file; do (something to $file); done

— 鲁本L.
source

2

ls -f会更好; 您是否真的要等待它stat()整理这么多文件？

— geekosaur 2011年

现在我正在尝试：* .txt中的f；等等完成。如果失败，我会给他一个重击。谢谢！

— 桑德罗（Sandro）