打印包含字符串和第一个单词的单词

10

我想在一行文本中找到一个字符串，然后打印该字符串（在空格之间）和该短语的第一个单词。

例如：

“这是单个文本行”
“另一件事”
“最好再试一次”
“更好”

字符串列表是：

文本
事情
尝试
更好

我正在尝试获取这样的表：

此[标签]文字
另一件事
它[选项卡]尝试
更好

我尝试了grep，但没有任何反应。有什么建议吗？

command-line text-processing regex

— 费利佩·里拉
source

因此，基本上“如果行有字符串，则打印第一个单词+字符串”。对？

— Sergiy Kolodyazhnyy

12

Bash / grep版本：

#!/bin/bash
# string-and-first-word.sh
# Finds a string and the first word of the line that contains that string.

text_file="$1"
shift

for string; do
    # Find string in file. Process output one line at a time.
    grep "$string" "$text_file" | 
        while read -r line
    do
        # Get the first word of the line.
        first_word="${line%% *}"
        # Remove special characters from the first word.
        first_word="${first_word//[^[:alnum:]]/}"

        # If the first word is the same as the string, don't print it twice.
        if [[ "$string" != "$first_word" ]]; then
            echo -ne "$first_word\t"
        fi

        echo "$string"
    done
done

这样称呼它：

./string-and-first-word.sh /path/to/file text thing try Better

输出：

This    text
Another thing
It  try
Better

— wjandrea
source

9

Perl进行救援！

#!/usr/bin/perl
use warnings;
use strict;

my $file = shift;
my $regex = join '|', map quotemeta, @ARGV;
$regex = qr/\b($regex)\b/;

open my $IN, '<', $file or die "$file: $!";
while (<$IN>) {
    if (my ($match) = /$regex/) {
        print my ($first) = /^\S+/g;
        if ($match ne $first) {
            print "\t$match";
        }
        print "\n";
    }
}

另存为first-plus-word，运行为

perl first-plus-word file.txt text thing try Better

它根据输入的单词创建一个正则表达式。然后将每一行与正则表达式进行匹配，如果存在匹配项，则打印第一个单词，如果与单词不同，则也打印该单词。

— Choroba
source

9

这是awk版本：

awk '
  NR==FNR {a[$0]++; next;} 
  {
    gsub(/"/,"",$0);
    for (i=1; i<=NF; i++)
      if ($i in a) printf "%s\n", i==1? $i : $1"\t"$i;
  }
  ' file2 file1

file2单词列表在哪里，file1包含短语。

— 钢铁司机
source

2

好一个！为了方便起见，我将其放入脚本文件paste.ubuntu.com/23063130

— Sergiy Kolodyazhnyy

8

这是python版本：

#!/usr/bin/env python
from __future__ import print_function 
import sys

# List of strings that you want
# to search in the file. Change it
# as you fit necessary. Remember commas
strings = [
          'text', 'thing',
          'try', 'Better'
          ]


with open(sys.argv[1]) as input_file:
    for line in input_file:
        for string in strings:
            if string in line:
               words = line.strip().split()
               print(words[0],end="")
               if len(words) > 1:
                   print("\t",string)
               else:
                   print("")

演示：

$> cat input_file.txt                                                          
This is a single text line
Another thing
It is better you try again
Better
$> python ./initial_word.py input_file.txt                                      
This    text
Another     thing
It  try
Better

旁注：该脚本python3兼容，因此您可以使用python2或来运行它python3。

— 塞尔吉·科洛季亚兹尼（Sergiy Kolodyazhnyy）
source

7

尝试这个：

$ sed -En 's/(([[:alnum:]]+)[[:space:]].*)?(text|thing|try|Better).*/\2\t\3/p' File
This    text
Another thing
It      try
        Better

如果之前的标签页Better有问题，请尝试以下操作：

$ sed -En 's/(([[:alnum:]]+)[[:space:]].*)?(text|thing|try|Better).*/\2\t\3/; ta; b; :a; s/^\t//; p' File
This    text
Another thing
It      try
Better

以上已在GNU sed（gsed在OSX 上称为）上进行了测试。对于BSD sed，可能需要进行一些小的更改。

这个怎么运作

s/(([[:alnum:]]+)[[:space:]].*)?(text|thing|try|Better).*/\2\t\3/

这会寻找一个单词，[[:alnum:]]+然后是一个空格，[[:space:]]然后是任何东西.*，然后是您的单词之一text|thing|try|Better，然后是任何东西。如果找到该单词，则将其替换为该行上的第一个单词（如果有），一个制表符和匹配的单词。
ta; b; :a; s/^\t//; p

如果替换命令导致替换，这意味着您在一行上找到了一个单词，则该ta命令告诉sed跳转到label a。如果不是，则分支（b）到下一行。 :a定义标签因此，如果找到您的一个单词，我们（a）进行替换s/^\t//，如果有一个单词，则删除前导制表符；以及（b）打印（p）该行。

— 约翰1024
source

7

一种简单的bash / sed方法：

$ while read w; do sed -nE "s/\"(\S*).*$w.*/\1\t$w/p" file; done < words 
This    text
Another thing
It  try
    Better

在while read w; do ...; done < words将文件中的每一行遍历words并保存为$w。默认情况下，-nmakes sed不打印任何内容。sed然后，该命令将替换双引号，然后替换非空格（\"(\S*)，括号用于“捕获”与之匹配\S*的第一个单词，然后我们可以将其称为\1），0个或多个字符（.*），然后我们正在寻找的字词（$w），然后再输入0个或更多字符（.*）。如果匹配，我们只用第一个单词，一个制表符和$w（\1\t$w）替换它，然后打印该行（pin就是s///p这样做的）。

— 特登
source

5

这是Ruby版本

str_list = ['text', 'thing', 'try', 'Better']

File.open(ARGV[0]) do |f|
  lines = f.readlines
  lines.each_with_index do |l, idx|
    if l.match(str_list[idx])
      l = l.split(' ')
      if l.length == 1
        puts l[0]
      else
        puts l[0] + "\t" + str_list[idx]
      end
    end
  end
end

示例文本文件hello.txt包含

This is a single text line
Another thing
It is better you try again
Better

运行ruby source.rb hello.txt结果

This    text
Another thing
It      try
Better

— 安华
source