Answers:
您需要split
从内置shlex
模块中。
>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']
这应该正是您想要的。
shlex.split('this is "a test"', posix=False)
返回['this', 'is', '"a test"']
shlex.split()
将触发UnicodeEncodeError
异常。
我在这里看到正则表达式方法看起来很复杂和/或错误。这让我感到惊讶,因为正则表达式语法可以轻松地描述“空格或引号引起的东西”,并且大多数正则表达式引擎(包括Python的)都可以在正则表达式上进行拆分。因此,如果您要使用正则表达式,为什么不直接说出您的意思呢?:
test = 'this is "a test"' # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
说明:
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
shlex可能提供更多功能。
根据您的用例,您可能还需要检出csv
模块:
import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
print(row)
输出:
['this', 'is', 'a string']
['and', 'more', 'stuff']
由于此问题是用正则表达式标记的,因此我决定尝试使用正则表达式方法。我首先用\ x00替换引号部分中的所有空格,然后按空格分割,然后将\ x00替换回每个部分中的空格。
两种版本都做同样的事情,但是splitter2比splitter2更具可读性。
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
似乎出于性能原因re
,速度更快。这是我使用保留外部引号的最小贪婪运算符的解决方案:
re.findall("(?:\".*?\"|\S)+", s)
结果:
['this', 'is', '"a test"']
aaa"bla blub"bbb
由于这些标记没有用空格分隔,因此将类似的结构留在了一起。如果字符串包含转义字符,则可以这样进行匹配:
>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
请注意,这也""
通过\S
模式的一部分与空字符串匹配。
,
via '(?:".*?"|[^,])+'
)的通用性。引号(括起来)字符也是如此。
被接受的主要问题 shlex
方法是它不会忽略引号子字符串之外的转义字符,并且在某些特殊情况下会产生一些意外的结果。
我有以下用例,在这里我需要一个拆分函数,该函数拆分输入字符串,以便保留单引号或双引号的子字符串,并能够在这样的子字符串中转义引号。无引号的字符串中的引号不应与其他任何字符区别对待。带有预期输出的一些示例测试用例:
输入字符串| 预期产量 ============================================== 'abc def'| ['abc','def'] “ abc \\ s def” | ['abc','\\ s','def'] '“ abc def” ghi'| ['abc def','ghi'] “'abc def'ghi” | ['abc def','ghi'] '“ abc \\” def“ ghi'| ['abc” def','ghi'] “'abc \\'def'ghi” | [“ abc'def”,'ghi'] “'abc \\ s def'ghi” | ['abc \\ s def','ghi'] '“ abc \\ s def” ghi'| ['abc \\ s def','ghi'] '“”测试'| ['','test'] “”测试” | ['','test'] “ abc'def” | [“ abc'def”] “ abc'def'” | [“ abc'def'”] “ abc'def'ghi” | [“ abc'def”“,'ghi'] “ abc'def'ghi” | [“ abc'def'ghi”] 'abc“ def'| ['abc” def'] 'abc“ def”'| ['abc“ def”'] 'abc“ def” ghi'| ['abc“ def”','ghi'] 'abc“ def” ghi'| ['abc“ def” ghi'] “ r'AA'r'。* _ xyz $'” | [“ r'AA'”,“ r'。* _ xyz $'”]
我最终得到了以下函数来拆分字符串,以便所有输入字符串的预期输出结果:
import re
def quoted_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]
下面的测试应用程序检查的其他方法的结果(shlex
和csv
现在)和自定义拆分实现:
#!/bin/python2.7
import csv
import re
import shlex
from timeit import timeit
def test_case(fn, s, expected):
try:
if fn(s) == expected:
print '[ OK ] %s -> %s' % (s, fn(s))
else:
print '[FAIL] %s -> %s' % (s, fn(s))
except Exception as e:
print '[FAIL] %s -> exception: %s' % (s, e)
def test_case_no_output(fn, s, expected):
try:
fn(s)
except:
pass
def test_split(fn, test_case_fn=test_case):
test_case_fn(fn, 'abc def', ['abc', 'def'])
test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
test_case_fn(fn, '"" test', ['', 'test'])
test_case_fn(fn, "'' test", ['', 'test'])
test_case_fn(fn, "abc'def", ["abc'def"])
test_case_fn(fn, "abc'def'", ["abc'def'"])
test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
test_case_fn(fn, 'abc"def', ['abc"def'])
test_case_fn(fn, 'abc"def"', ['abc"def"'])
test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])
def csv_split(s):
return list(csv.reader([s], delimiter=' '))[0]
def re_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]
if __name__ == '__main__':
print 'shlex\n'
test_split(shlex.split)
print
print 'csv\n'
test_split(csv_split)
print
print 're\n'
test_split(re_split)
print
iterations = 100
setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
def benchmark(method, code):
print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
benchmark('csv', 'test_split(csv_split, test_case_no_output)')
benchmark('re', 'test_split(re_split, test_case_no_output)')
输出:
lex [OK] abc def-> ['abc','def'] [失败] abc \ s def-> ['abc','s','def'] [OK]“ abc def” ghi-> ['abc def','ghi'] [OK]'abc def'ghi-> ['abc def','ghi'] [OK]“ abc \” def“ ghi-> ['abc” def','ghi'] [FAIL]'abc \'def'ghi->例外:无右引号 [OK]'abc \ s def'ghi-> ['abc \\ s def','ghi'] [确定]“ abc \ s def” ghi-> ['abc \\ s def','ghi'] [OK]“” test-> [“,'test'] [确定]''测试-> ['','测试'] [FAIL] abc'def->例外:无结束报价 [失败] abc'def'-> ['abcdef'] [FAIL] abc'def'ghi-> ['abcdef','ghi'] [失败] abc'def'ghi-> ['abcdefghi'] [FAIL] abc“ def->异常:无右引号 [失败] abc“ def”-> ['abcdef'] [FAIL] abc“ def” ghi-> ['abcdef','ghi'] [失败] abc“ def” ghi-> ['abcdefghi'] [失败] r'AA'r'。* _ xyz $'-> ['rAA','r。* _ xyz $'] CSV [OK] abc def-> ['abc','def'] [确定] abc \ s def-> ['abc','\\ s','def'] [OK]“ abc def” ghi-> ['abc def','ghi'] [失败]'abc def'ghi-> [“'abc”,“ def'”,'ghi'] [失败]“ abc \” def“ ghi-> ['abc \\','def”','ghi'] [FAIL]'abc \'def'ghi-> [“'abc”,“ \\'”,“ def'”,'ghi'] [失败]'abc \ s def'ghi-> [“'abc”,'\\ s',“ def'”,'ghi'] [确定]“ abc \ s def” ghi-> ['abc \\ s def','ghi'] [OK]“” test-> [“,'test'] [失败]''测试-> [“''”,'测试'] [OK] abc'def-> [“ abc'def”] [OK] abc'def'-> [“ abc'def'”] [OK] abc'def'ghi-> [“ abc'def'”,'ghi'] [OK] abc'def'ghi-> [“ abc'def'ghi”] [OK] abc“ def-> ['abc” def'] [OK] abc“ def”-> ['abc“ def”'] [OK] abc“ def” ghi-> ['abc“ def”','ghi'] [OK] abc“ def” ghi-> ['abc“ def” ghi'] [OK] r'AA'r'。* _ xyz $'-> [“ r'AA'”,“ r'。* _ xyz $'”] 回覆 [OK] abc def-> ['abc','def'] [确定] abc \ s def-> ['abc','\\ s','def'] [OK]“ abc def” ghi-> ['abc def','ghi'] [OK]'abc def'ghi-> ['abc def','ghi'] [OK]“ abc \” def“ ghi-> ['abc” def','ghi'] [OK]'abc \'def'ghi-> [“ abc'def”,'ghi'] [OK]'abc \ s def'ghi-> ['abc \\ s def','ghi'] [确定]“ abc \ s def” ghi-> ['abc \\ s def','ghi'] [OK]“” test-> [“,'test'] [确定]''测试-> ['','测试'] [OK] abc'def-> [“ abc'def”] [OK] abc'def'-> [“ abc'def'”] [OK] abc'def'ghi-> [“ abc'def'”,'ghi'] [OK] abc'def'ghi-> [“ abc'def'ghi”] [OK] abc“ def-> ['abc” def'] [OK] abc“ def”-> ['abc“ def”'] [OK] abc“ def” ghi-> ['abc“ def”','ghi'] [OK] abc“ def” ghi-> ['abc“ def” ghi'] [OK] r'AA'r'。* _ xyz $'-> [“ r'AA'”,“ r'。* _ xyz $'”] shlex:每次迭代0.281ms csv:每次迭代0.030ms re:每次迭代0.049ms
因此,性能要比更好shlex
,并且可以通过预编译正则表达式来进一步提高性能,在这种情况下它将优于该csv
方法。
shlex
都不如我的用例预期的那样。
要保留引号,请使用以下功能:
def getArgs(s):
args = []
cur = ''
inQuotes = 0
for char in s.strip():
if char == ' ' and not inQuotes:
args.append(cur)
cur = ''
elif char == '"' and not inQuotes:
inQuotes = 1
cur += char
elif char == '"' and inQuotes:
inQuotes = 0
cur += char
else:
cur += char
args.append(cur)
return args
速度测试的不同答案:
import re
import shlex
import csv
line = 'this is "a test"'
%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop
%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop
%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop
%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
嗯,似乎无法找到“ Reply”按钮……无论如何,此答案基于Kate的方法,但正确地将字符串与包含转义引号的子字符串分开,并且还删除了子字符串的开始和结束引号:
[i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
这适用于类似'This is " a \\\"test\\\"\\\'s substring"'
的字符串(不幸的是,必须使用疯狂的标记来防止Python删除转义符)。
如果不需要返回列表中的字符串中的结果转义符,则可以使用此函数的稍有改动的版本:
[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
为了解决某些Python 2版本中的unicode问题,我建议:
from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]
split = lambda a: [b.decode('utf-8') for b in _split(a)]
否则,您会得到:UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)
作为一种选择,尝试tssplit:
In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']
我建议:
测试字符串:
s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''
同时捕获“”和“”:
import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)
结果:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]
忽略空的“”和“”:
import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)
结果:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
re.findall("(?:\".*?\"|'.*?'|[^\s'\"]+)", s)
也可以写成。
如果您不关心子字符串而不是简单的
>>> 'a short sized string with spaces '.split()
性能:
>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass
或字符串模块
>>> from string import split as stringsplit;
>>> stringsplit('a short sized string with spaces '*100)
性能:字符串模块似乎比字符串方法的性能更好
>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass
或者您可以使用RE引擎
>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)
性能
>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass
对于非常长的字符串,您不应将整个字符串加载到内存中,而应拆分行或使用迭代循环
试试这个:
def adamsplit(s):
result = []
inquotes = False
for substring in s.split('"'):
if not inquotes:
result.extend(substring.split())
else:
result.append(substring)
inquotes = not inquotes
return result
一些测试字符串:
'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]
adamsplit("This is 'a test'")
→['This', 'is', "'a", "test'"]