如何从文本文件的行中删除特定单词？

13

我的文本文件如下所示：

Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341

现在我想Liquid penetration 95% mass (m)从行中删除以仅获取值。我该怎么办？

command-line text-processing

— OE
source

3

只是grep -o '[^[:space:]]\+$' file

— Avinash Raj

@AvinashRaj：目前，此解决方案获得了“ putty奖牌” :)

— pa4080

2

@ pa4080至少对于我测试的输入（1000万行），使用PCRE可以使Avinash Raj的通用方法快一个数量级。（我可以确认引擎是模式，而不是模式，这是负责任的，因为GNU grep \S+$使用-Eor或接受-P。）因此，这种解决方案本质上并不是很慢。但是我仍然无法找到任何接近αғsнιη cut方法的方法，该方法也赢得了您的基准。

— 伊莱亚·卡根

22

如果只有一个=符号，则可以=像这样删除所有内容：

$ sed -r 's/.* = (.*)/\1/' file
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

如果要更改原始文件，请-i在测试后使用该选项：

sed -ri 's/.* = (.*)/\1/' file

笔记

-r使用ERE所以我们没有逃避(和)
s/old/new替换old为new
.* 任何数量的任何字符
(things)保存things迟些反向引用\1，\2等等。

— 赞娜
source

谢谢它的工作。我使用以下命令覆盖了现有文件：sed -i -r's /.*=（。*）/ \ 1 /'time.txt您能否解释一下它是如何工作的？

— OE

为什么不避免使用反向引用？s/^.*= //因为正确的值在行的末尾，所以同样可以很好地工作。

— jpaugh

@jpaugh好吧，部分原因是现在更改我的答案（为第一个发布的答案）为时已晚-其他人已经给出了您提到的解决方案以及这种情况下的其他更有效的方法:)但是，也许展示如何使用\1etc对于那些搜寻时会遇到这个问题，谁没有这么简单的问题

— Zanna

@Zanna至少更一般。

— jpaugh

21

这是一项工作awk; 假设值仅出现在最后一个字段中（根据您的示例）：

awk '{print $NF}' file.txt

NF是一个awk变量，扩展为记录（行）中的字段数，因此$NF（请注意$前面的）包含最后一个字段的值。

例：

% cat temp.txt 
Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341

% awk '{print $NF}' temp.txt
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

— 血红素
source

13

我决定比较此处列出的不同解决方案。为此，我根据OP提供的内容创建了一个大文件：

我创建了一个简单的文件，名为input.file：

$ cat input.file
Liquid penetration 95% mass (m) = 0.000205348
Liquid penetration 95% mass (m) = 0.000265725
Liquid penetration 95% mass (m) = 0.000322823
Liquid penetration 95% mass (m) = 0.000376445
Liquid penetration 95% mass (m) = 0.000425341

然后我执行了这个循环：

for i in {1..100}; do cat input.file | tee -a input.file; done

终端窗口被阻止。我killall tee从另一个终端执行。然后，我通过以下命令检查了文件的内容：less input.file和cat input.file。看起来不错，除了最后一行。因此，我删除了最后一行并创建了备份副本：（cp input.file{,.copy}由于使用inplace选项的命令）。
文件中的行的最终计数input.file为2 192 473。我通过命令获得了该号码wc：
```
$ cat input.file | wc -l
2192473
```

这是比较的结果：

grep -o '[^[:space:]]\+$'

$ time grep -o'[^ [：space：]] \ + $'input.file> output.file

真正的0m58.539s
用户0m58.416s
sys 0m0.108s

sed -ri 's/.* = (.*)/\1/'

$ time sed -ri's /.* =（。*）/ \ 1 /'input.file

真正的0m26.936s
用户0m22.836s
sys 0m4.092s

或者，如果我们将输出重定向到新文件，则命令会更快：

$ time sed -r's /.* =（。*）/ \ 1 /'input.file> output.file

真正的0m19.734s
用户0m19.672s
sys 0分0.056s

gawk '{gsub(".*= ", "");print}'

$ time gawk'{gsub（“。* =”，“”）; print}'input.file> output.file

真正的0m5.644s
用户0m5.568s
sys 0m0.072s

rev | cut -d' ' -f1 | rev

$ time rev input.file | 切-d''-f1 | 转速>输出文件

真正的0m3.703s
用户0m2.108s
sys 0m4.916s

grep -oP '.*= \K.*'

$ time grep -oP'。* = \ K. *'input.file> output.file

真正的0m3.328s
用户0m3.252s
sys 0m0.072s

sed 's/.*= //' （分别使该-i选项使命令变慢几倍）

$ time sed's /.*= //'input.file> output.file

真正的0m3.310s
用户0m3.212s
sys 0m0.092s

perl -pe 's/.*= //' （该-i选项不会在这里产生很大的生产率差异）

$ time perl -i.bak -pe's /.*= //'input.file

真正的0m3.187s
用户0m3.128s
sys 0分0.056s

$ time perl -pe's /.*= //'input.file> output.file

真正的0m3.138s
用户0m3.036s
sys 0m0.100s

awk '{print $NF}'

$ time awk'{print $ NF}'input.file> output.file

真正的0m1.251s
用户0m1.164s
sys 0m0.084s

cut -c 35-

$ time cut -c 35- input.file> output.file

真正的0m0.352s
用户0m0.284s
sys 0m0.064s

cut -d= -f2

$ time cut -d = -f2 input.file> output.file

真正的0m0.328s
用户0m0.260s
sys 0m0.064s

想法的来源。

— PA4080
source

2

这样我的cut -d= -f2解决方案就赢了。哈哈

— αғsнιη

您能否提供有关如何创建此文件的更多信息？另外，如何wc -l输出三个数字？当没有其他选项传递时，该-l选项应抑制除行数以外的所有内容。

— 伊莱亚·卡根

@EliahKagan，完成了。我已经更新了答案。

— pa4080

啊，我知道了-空格是数字组分隔符。（wc实际上已经显示了这些空格吗？是否有将要使用的语言环境设置？）感谢您的更新！

— 伊莱亚·卡根

@EliahKagan：最后，我阅读了您关于wc另一时间的问题。我不知道今天早些时候我的智慧在哪里，但是我真的听不懂。因此，确实这些空格是数字组分隔符，并且wc不添加它们：)

— pa4080

12

用grep和-P用于具有PCRE（解读图案作为P ERL- Ç ompatible ř egular È XPRESSION）和-o打印匹配的单独模式。该\K通知会忽略自己之前所匹配的部分来了。

$ grep -oP '.*= \K.*' infile
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

或者您也可以使用cut命令。

cut -d= -f2 infile

— αғsнιη
source

2

除了运行在所有测试方法最快pa4080的基准，将cut在这个答案的方法也是在伯仲较小的基准我跑的是测试方法较少，但使用较大的输入文件。它比我个人喜欢的快速方法快十倍以上（而我的回答主要是关于此方法的）。

— 伊莱亚·卡根

11

由于行前缀始终具有相同的长度（34个字符），因此可以使用cut：

cut -c 35- < input.txt > output.txt

— 大卫·福斯特
source

6

用反转文件的内容rev，将输出cut以空格作为定界符，以1作为目标字段，然后再次反转以获得原始编号：

$ rev your_file | cut -d' ' -f1 | rev
0.000205348
0.000265725
0.000322823
0.000376445
0.000425341

— f1nan
source

5

这很简单，简短，并且易于编写，理解和检查，我个人很喜欢：

grep -oE '\S+$' file

grep在Ubuntu中，当用-E或调用时-P，用速记 \s表示空格字符（实际上通常是空格或制表符），\S而表示不属于此的任何字符。使用量词+和行尾锚$，该模式将\S+$匹配行尾的一个或多个非空白。您可以使用-P代替-E; 在这种情况下，含义是相同的，但是使用了不同的正则表达式引擎，因此它们可能具有不同的性能特征。

这等效于Avinash Raj的注释解决方案（只是使用了更简单，更紧凑的语法）：

grep -o '[^[:space:]]\+$' file

如果数字后面可能存在尾随空格，则这些方法将行不通。可以对其进行修改，以便进行修改，但我认为此处没有意义。尽管有时将解决方案推广到更多情况下是有启发性的，但这样做似乎不像人们通常想象的那么频繁，因为通常没有办法知道问题最终可能需要以多种不同的不兼容方式中的哪一种来解决。被概括。

性能有时是一个重要的考虑因素。这个问题并不能说明输入的内容是否很大，并且这里发布的每个方法都可能足够快。但是，如果需要速度，这是一千万个行输入文件的小基准：

$ perl -e 'print((<>) x 2000000)' file > bigfile
$ du -sh bigfile
439M    bigfile
$ wc -l bigfile
10000000 bigfile
$ TIMEFORMAT=%R
$ time grep -o '[^[:space:]]\+$' bigfile > bigfile.out
819.565
$ time grep -oE '\S+$' bigfile > bigfile.out
816.910
$ time grep -oP '\S+$' bigfile > bigfile.out
67.465
$ time cut -d= -f2 bigfile > bigfile.out
3.902
$ time grep -o '[^[:space:]]\+$' bigfile > bigfile.out
815.183
$ time grep -oE '\S+$' bigfile > bigfile.out
824.546
$ time grep -oP '\S+$' bigfile > bigfile.out
68.692
$ time cut -d= -f2 bigfile > bigfile.out
4.135

我运行了两次，以防顺序很重要（有时对于处理I / O繁重的任务也是如此），并且因为我没有一台可以在后台执行其他操作而导致结果不正确的机器。根据这些结果，至少对于我使用的大小的输入文件，我至少得出以下结论：

哇！传递-P（以使用PCRE），而不是-G或（当没有指定方言默认）-E提出grep通过更快了一个数量级。因此，对于大文件，使用此命令可能比上面显示的命令更好：
```
grep -oP '\S+$' file
```
哇！！αғsнιη的答案，中的cut方法比我的方法中更快的版本还要快一个数量级！它也是pa4080基准测试的获胜者，该基准测试涵盖了更多方法，但输入量却更少-这就是为什么我选择了所有其他方法中的方法，将其包括在测试中。如果性能很重要或文件很大，我认为应该使用αғsнιη的方法。cut -d= -f2 filecut

这也提醒我们，不要忘记简单cut和paste实用程序，并且在适用时应该首选简单和实用程序，即使grep通常将这类更复杂的工具作为一线解决方案提供（我个人比较习惯）使用）。

— 埃利亚·卡根（Eliah Kagan）
source

4

perl- 小号 ubstitute模式/.*= /与空字符串//：

perl -pe 's/.*= //' input.file > output.file

perl -i.bak -pe 's/.*= //' input.file

来自perl --help：

-e program        one line of program (several -e's allowed, omit programfile)
-p                assume loop like -n but print line also, like sed
-i[extension]     edit <> files in place (makes backup if extension supplied)

sed -用空字符串替换模式：

sed 's/.*= //' input.file > output.file

或（但比上述速度慢）：

sed -i.bak 's/.*= //' input.file

我提到了这种方法，因为它比Zanna的答案快几倍。

gawk- ".*= "用空字符串替换模式""：

gawk '{gsub(".*= ", "");print}' input.file > output.file

来自man gawk：

gsub(r, s [, t]) For each substring matching the regular expression r in the string t,
                 substitute the string s, and return the number of substitutions. 
                 If t is not supplied, use $0...

— pa4080
source