在Python中按空格分隔字符串-保留带引号的子字符串

268

我有一个像这样的字符串：

this is "a test"

我正在尝试在Python中编写一些内容，以按空格将其拆分，同时忽略引号内的空格。我正在寻找的结果是：

['this','is','a test']

PS。我知道您会问：“如果引号内有引号，将会发生什么情况？在我的应用程序中，那将永远不会发生。

python regex

— 亚当·皮尔斯
source

1

感谢您提出这个问题。这正是我修复pypar构建模块所需的。

— Martlark

391

您需要split从内置shlex模块中。

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

这应该正是您想要的。

— 耶鲁布
source

13

使用“ posix = False”保留报价。shlex.split('this is "a test"', posix=False)返回['this', 'is', '"a test"']

— Boon

@MatthewG。Python 2.7.3中的“修复”意味着传递unicode字符串shlex.split()将触发UnicodeEncodeError异常。

— Rockallite

57

看看shlex模块，特别是shlex.split。

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

— 艾伦
source

40

我在这里看到正则表达式方法看起来很复杂和/或错误。这让我感到惊讶，因为正则表达式语法可以轻松地描述“空格或引号引起的东西”，并且大多数正则表达式引擎（包括Python的）都可以在正则表达式上进行拆分。因此，如果您要使用正则表达式，为什么不直接说出您的意思呢？：

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

说明：

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex可能提供更多功能。

1

我的想法大致相同，但是建议在re.findall（r'[^ \ s“] + |” [^“] *”'，'这是“测试“））

— 达里乌斯·培根

2

+1我正在使用它，因为它比shlex快很多。

— hanleyp

3

为什么是三反斜杠？一个简单的反斜杠不会做同样的事情吗？

— Doppelganger

1

实际上，我对此不满意的一件事是，引号前后的任何内容均未正确分割。如果我有这样的字符串'PARAMS val1 =“ Thing” val2 =“ Thing2”'。我希望将字符串拆分为三部分，但拆分为5。自从完成正则表达式以来已经有一段时间了，所以我不想立即尝试使用您的解决方案来解决它。

— leetNightshade 2013年

1

使用正则表达式时，应使用原始字符串。

— asmeurer 2013年

28

根据您的用例，您可能还需要检出csv模块：

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

输出：

['this', 'is', 'a string']
['and', 'more', 'stuff']

— 瑞安·金斯特伦（Ryan Ginstrom）
source

2

当shlex删除一些需要的字符时很有用

— scraplesh 2013年

1

CSV 连续使用两个双引号（如并排的""）来表示一个双引号"，因此会将两个双引号变成一个单引号，'this is "a string""'并且'this is "a string"""'都将映射到['this', 'is', 'a string"']

— Boris

15

我使用shlex.split处理70,000,000行的鱿鱼日志，它是如此缓慢。所以我转去重新。

如果shlex有性能问题，请尝试此操作。

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

— 戴丹
source

8

由于此问题是用正则表达式标记的，因此我决定尝试使用正则表达式方法。我首先用\ x00替换引号部分中的所有空格，然后按空格分割，然后将\ x00替换回每个部分中的空格。

两种版本都做同样的事情，但是splitter2比splitter2更具可读性。

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

— 改良剂
source

您应该改用re.Scanner。它更可靠（实际上我已经使用re.Scanner实现了类似于shlex的功能）。

— Devin Jeanpierre 09年

+1嗯，这是一个非常聪明的主意，将问题分为多个步骤，因此答案并不十分复杂。Shlex并没有完全按照我的需要做，即使尝试进行调整。单遍正则表达式解决方案变得越来越奇怪和复杂。

— leetNightshade 2013年

6

似乎出于性能原因re，速度更快。这是我使用保留外部引号的最小贪婪运算符的解决方案：

re.findall("(?:\".*?\"|\S)+", s)

结果：

['this', 'is', '"a test"']

aaa"bla blub"bbb由于这些标记没有用空格分隔，因此将类似的结构留在了一起。如果字符串包含转义字符，则可以这样进行匹配：

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

请注意，这也""通过\S模式的一部分与空字符串匹配。

— 霍赫尔
source

1

该解决方案的另一个重要优点是相对于定界字符（例如,via '(?:".*?"|[^,])+'）的通用性。引号（括起来）字符也是如此。

— a_guest

4

被接受的主要问题 shlex方法是它不会忽略引号子字符串之外的转义字符，并且在某些特殊情况下会产生一些意外的结果。

我有以下用例，在这里我需要一个拆分函数，该函数拆分输入字符串，以便保留单引号或双引号的子字符串，并能够在这样的子字符串中转义引号。无引号的字符串中的引号不应与其他任何字符区别对待。带有预期输出的一些示例测试用例：

输入字符串| 预期产量
==============================================
 'abc def'| ['abc'，'def']
 “ abc \\ s def” | ['abc'，'\\ s'，'def']
 '“ abc def” ghi'| ['abc def'，'ghi']
 “'abc def'ghi” | ['abc def'，'ghi']
 '“ abc \\” def“ ghi'| ['abc” def'，'ghi']
 “'abc \\'def'ghi” | [“ abc'def”，'ghi']
 “'abc \\ s def'ghi” | ['abc \\ s def'，'ghi']
 '“ abc \\ s def” ghi'| ['abc \\ s def'，'ghi']
 '“”测试'| [''，'test']
 “”测试” | [''，'test']
 “ abc'def” | [“ abc'def”]
 “ abc'def'” | [“ abc'def'”]
 “ abc'def'ghi” | [“ abc'def”“，'ghi']
 “ abc'def'ghi” | [“ abc'def'ghi”]
 'abc“ def'| ['abc” def']
 'abc“ def”'| ['abc“ def”']
 'abc“ def” ghi'| ['abc“ def”'，'ghi']
 'abc“ def” ghi'| ['abc“ def” ghi']
 “ r'AA'r'。* _ xyz $'” | [“ r'AA'”，“ r'。* _ xyz $'”]

我最终得到了以下函数来拆分字符串，以便所有输入字符串的预期输出结果：

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

下面的测试应用程序检查的其他方法的结果（shlex和csv现在）和自定义拆分实现：

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

输出：

lex

[OK] abc def-> ['abc'，'def']
[失败] abc \ s def-> ['abc'，'s'，'def']
[OK]“ abc def” ghi-> ['abc def'，'ghi']
[OK]'abc def'ghi-> ['abc def'，'ghi']
[OK]“ abc \” def“ ghi-> ['abc” def'，'ghi']
[FAIL]'abc \'def'ghi->例外：无右引号
[OK]'abc \ s def'ghi-> ['abc \\ s def'，'ghi']
[确定]“ abc \ s def” ghi-> ['abc \\ s def'，'ghi']
[OK]“” test-> [“，'test']
[确定]''测试-> [''，'测试']
[FAIL] abc'def->例外：无结束报价
[失败] abc'def'-> ['abcdef']
[FAIL] abc'def'ghi-> ['abcdef'，'ghi']
[失败] abc'def'ghi-> ['abcdefghi']
[FAIL] abc“ def->异常：无右引号
[失败] abc“ def”-> ['abcdef']
[FAIL] abc“ def” ghi-> ['abcdef'，'ghi']
[失败] abc“ def” ghi-> ['abcdefghi']
[失败] r'AA'r'。* _ xyz $'-> ['rAA'，'r。* _ xyz $']

CSV

[OK] abc def-> ['abc'，'def']
[确定] abc \ s def-> ['abc'，'\\ s'，'def']
[OK]“ abc def” ghi-> ['abc def'，'ghi']
[失败]'abc def'ghi-> [“'abc”，“ def'”，'ghi']
[失败]“ abc \” def“ ghi-> ['abc \\'，'def”'，'ghi']
[FAIL]'abc \'def'ghi-> [“'abc”，“ \\'”，“ def'”，'ghi']
[失败]'abc \ s def'ghi-> [“'abc”，'\\ s'，“ def'”，'ghi']
[确定]“ abc \ s def” ghi-> ['abc \\ s def'，'ghi']
[OK]“” test-> [“，'test']
[失败]''测试-> [“''”，'测试']
[OK] abc'def-> [“ abc'def”]
[OK] abc'def'-> [“ abc'def'”]
[OK] abc'def'ghi-> [“ abc'def'”，'ghi']
[OK] abc'def'ghi-> [“ abc'def'ghi”]
[OK] abc“ def-> ['abc” def']
[OK] abc“ def”-> ['abc“ def”']
[OK] abc“ def” ghi-> ['abc“ def”'，'ghi']
[OK] abc“ def” ghi-> ['abc“ def” ghi']
[OK] r'AA'r'。* _ xyz $'-> [“ r'AA'”，“ r'。* _ xyz $'”]

回覆

[OK] abc def-> ['abc'，'def']
[确定] abc \ s def-> ['abc'，'\\ s'，'def']
[OK]“ abc def” ghi-> ['abc def'，'ghi']
[OK]'abc def'ghi-> ['abc def'，'ghi']
[OK]“ abc \” def“ ghi-> ['abc” def'，'ghi']
[OK]'abc \'def'ghi-> [“ abc'def”，'ghi']
[OK]'abc \ s def'ghi-> ['abc \\ s def'，'ghi']
[确定]“ abc \ s def” ghi-> ['abc \\ s def'，'ghi']
[OK]“” test-> [“，'test']
[确定]''测试-> [''，'测试']
[OK] abc'def-> [“ abc'def”]
[OK] abc'def'-> [“ abc'def'”]
[OK] abc'def'ghi-> [“ abc'def'”，'ghi']
[OK] abc'def'ghi-> [“ abc'def'ghi”]
[OK] abc“ def-> ['abc” def']
[OK] abc“ def”-> ['abc“ def”']
[OK] abc“ def” ghi-> ['abc“ def”'，'ghi']
[OK] abc“ def” ghi-> ['abc“ def” ghi']
[OK] r'AA'r'。* _ xyz $'-> [“ r'AA'”，“ r'。* _ xyz $'”]

shlex：每次迭代0.281ms
csv：每次迭代0.030ms
re：每次迭代0.049ms

因此，性能要比更好shlex，并且可以通过预编译正则表达式来进一步提高性能，在这种情况下它将优于该csv方法。

— 范·范·休维尔
source

不确定您在说什么：```>>> shlex.split（'this is“ test”'）['this'，'is'，'test'] >>> shlex.split（'这是\\“测试\\”'）['this'，'is'，'“ a'，'test”'] >>> shlex.split（'这是“ \\” test \\“ “'）['this'，'is'，'a” test“']```

— morsik

@morsik，您的意思是？也许您的用例与我的不匹配？当您查看测试用例时，您会看到所有情况shlex都不如我的用例预期的那样。

— Ton van den Heuvel

3

要保留引号，请使用以下功能：

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

— THE_MAD_KING
source

与较大的字符串进行比较时，您的函数是如此缓慢

— Faran2007 '19

3

速度测试的不同答案：

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

— har777
source

1

嗯，似乎无法找到“ Reply”按钮……无论如何，此答案基于Kate的方法，但正确地将字符串与包含转义引号的子字符串分开，并且还删除了子字符串的开始和结束引号：

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

这适用于类似'This is " a \\\"test\\\"\\\'s substring"'的字符串（不幸的是，必须使用疯狂的标记来防止Python删除转义符）。

如果不需要返回列表中的字符串中的结果转义符，则可以使用此函数的稍有改动的版本：

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

1

为了解决某些Python 2版本中的unicode问题，我建议：

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

— Moschlar
source

对于python 2.7.5，应该是：split = lambda a: [b.decode('utf-8') for b in _split(a)]否则，您会得到：UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)

— Peter Varo 2013年

1

作为一种选择，尝试tssplit：

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

— 米哈伊尔·扎哈罗夫（Mikhail Zakharov）
source

0

我建议：

测试字符串：

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

同时捕获“”和“”：

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

结果：

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

忽略空的“”和“”：

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

结果：

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

— 粗鲁的
source

re.findall("(?:\".*?\"|'.*?'|[^\s'\"]+)", s)也可以写成。

— hochl

-3

如果您不关心子字符串而不是简单的

>>> 'a short sized string with spaces '.split()

性能：

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

或字符串模块

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

性能：字符串模块似乎比字符串方法的性能更好

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

或者您可以使用RE引擎

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

性能

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

对于非常长的字符串，您不应将整个字符串加载到内存中，而应拆分行或使用迭代循环

— 格雷戈里
source

11

您似乎错过了整个问题的重点。字符串中带引号的部分不需要拆分。

— rjmunro

-3

试试这个：

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

一些测试字符串：

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

— j
source

请提供您认为会失败的字符串的代表。

— pjz

觉得呢？ adamsplit("This is 'a test'")→['This', 'is', "'a", "test'"]

— Matthew Schinckel，2016年

OP仅说“在引号内”，并且只有一个带有双引号的示例。

— pjz