如何在python中拆分但忽略带引号的字符串中的分隔符？

Question 1

我需要在分号上分割这样的字符串。但是我不想分割字符串（“或”）内的分号。我不是在解析文件；只是一个没有换行符的简单字符串。

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

结果应为：

第1部分
“这是；第2部分；”
'这是 ; 第三部分
第4部分
这是“一部分” 5

我想这可以用正则表达式来完成，如果不能的话；我愿意接受另一种方法。

Question 2

大多数答案似乎过于复杂。你并不需要反向引用。您并不需要依赖于是否re.findall给出重叠的匹配。鉴于输入无法使用csv模块进行解析，因此正则表达式是唯一可行的方法，您所需要做的就是使用与字段匹配的模式调用re.split。

请注意，这里匹配字段比匹配分隔符要容易得多：

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

输出为：

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

正如Jean-Luc Nacif Coelho正确指出的那样，这将无法正确处理空组。根据情况可能会或可能不会重要。如果确实如此，则可以通过以下方式处理该问题：例如，将必须知道在拆分之前未出现在数据中的某些字符串（不带分号）替换';;'为';<marker>;'where <marker>。另外，您还需要在以下时间恢复数据：

>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]

但是，这太过分了。还有更好的建议吗？

Question 3

re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)

每次找到分号时，超前扫描都会扫描剩余的整个字符串，以确保单引号的偶数和双引号的偶数。（忽略双引号字段中的单引号，反之亦然。）如果超前成功，则分号是定界符。

与Duncan的解决方案不同，该解决方案匹配字段而不是定界符，而对于空字段则没有问题。（甚至没有最后一个：与许多其他split实现不同，Python不会自动丢弃尾随的空字段。）

Question 4

>>> a='A,"B,C",D'
>>> a.split(',')
['A', '"B', 'C"', 'D']

It failed. Now try csv module
>>> import csv
>>> from StringIO import StringIO
>>> data = StringIO(a)
>>> data
<StringIO.StringIO instance at 0x107eaa368>
>>> reader = csv.reader(data, delimiter=',') 
>>> for row in reader: print row
... 
['A,"B,C",D']

Question 5

这是带注释的pyparsing方法：

from pyparsing import (printables, originalTextFor, OneOrMore, 
    quotedString, Word, delimitedList)

# unquoted words can contain anything but a semicolon
printables_less_semicolon = printables.replace(';','')

# capture content between ';'s, and preserve original text
content = originalTextFor(
    OneOrMore(quotedString | Word(printables_less_semicolon)))

# process the string
print delimitedList(content, ';').parseString(test)

给予

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 
 'this "is ; part" 5']

通过使用提供的pyparsing quotedString，您还可以获得对转义引号的支持。

您还不清楚如何在分号定界符之前或之后处理前导空白，并且示例文本中的任何字段都没有。Pyparsing会将“ a; b; c”解析为：

['a', 'b', 'c']

Question 6

您似乎有一个用分号分隔的字符串。为什么不使用csv模块来完成所有艰苦的工作呢？

从我的头顶上，这应该工作

import csv 
from StringIO import StringIO 

line = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''

data = StringIO(line) 
reader = csv.reader(data, delimiter=';') 
for row in reader: 
    print row

这应该给你像
("part 1", "this is ; part 2;", 'this is ; part 3', "part 4", "this \"is ; part\" 5")

编辑：
不幸的是，由于混合了字符串引号（单引号和双引号），所以这还行不通（即使您确实使用StringIO，我也想这样做）。你真正得到的是

['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5']。

如果您可以将数据更改为在适当位置仅包含单引号或双引号，则它应该可以正常工作，但这种方式会否定该问题。

Question 7

>>> x = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> import re
>>> re.findall(r'''(?:[^;'"]+|'(?:[^']|\\.)*'|"(?:[^']|\\.)*")+''', x)
['part 1', "this is ';' part 2", "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

Question 8

尽管可以通过先行/后备/反向引用使用PCRE完成此操作，但由于需要匹配平衡的引号对，因此regex实际上并不是一项真正的任务。

相反，最好只制作一个迷你状态机，然后像这样解析字符串。

编辑

事实证明，由于Python方便的附加功能re.findall可保证不重叠的匹配，因此使用Python中的正则表达式比其他方式更容易实现。有关详细信息，请参见评论。

但是，如果您对非正则表达式实现的外观感到好奇：

x = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

results = [[]]
quote = None
for c in x:
  if c == "'" or c == '"':
    if c == quote:
      quote = None
    elif quote == None:
      quote = c
  elif c == ';':
    if quote == None:
      results.append([])
      continue
  results[-1].append(c)

results = [''.join(x) for x in results]

# results = ['part 1', '"this is ; part 2;"', "'this is ; part 3'",
#            'part 4', 'this "is ; part" 5']

Question 9

我们可以创建自己的功能

def split_with_commas_outside_of_quotes(string):
    arr = []
    start, flag = 0, False
    for pos, x in enumerate(string):
        if x == '"':
            flag= not(flag)
        if flag == False and x == ',':
            arr.append(string[start:pos])
            start = pos+1
    arr.append(string[start:pos])
    return arr

Question 10

这个正则表达式可以做到这一点： (?:^|;)("(?:[^"]+|"")*"|[^;]*)

Question 11

由于您没有'\ n'，请使用它替换任何';' 不在引号字符串中

>>> new_s = ''
>>> is_open = False

>>> for c in s:
...     if c == ';' and not is_open:
...         c = '\n'
...     elif c in ('"',"'"):
...         is_open = not is_open
...     new_s += c

>>> result = new_s.split('\n')

>>> result
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

Question 12

即使我确定有一个干净的正则表达式解决方案（到目前为止，我喜欢@noiflection的答案），但这还是一个快速而肮脏的非正则表达式答案。

s = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

inQuotes = False
current = ""
results = []
currentQuote = ""
for c in s:
    if not inQuotes and c == ";":
        results.append(current)
        current = ""
    elif not inQuotes and (c == '"' or c == "'"):
        currentQuote = c
        inQuotes = True
    elif inQuotes and c == currentQuote:
        currentQuote = ""
        inQuotes = False
    else:
        current += c

results.append(current)

print results
# ['part 1', 'this is ; part 2;', 'this is ; part 3', 'part 4', 'this is ; part 5']

（我从来没有把这种东西放在一起，随时批评我的形式！）

Question 13

我的方法是用另一个永远不会出现在文本中的字符替换所有未引用的分号，然后在该字符上拆分。以下代码将re.sub函数与函数参数一起使用，以用srch字符串搜索并替换所有出现的字符串，而不是用单引号或双引号或括号，方括号或大括号括起来的所有repl字符串：

def srchrepl(srch, repl, string):
    """
    Replace non-bracketed/quoted occurrences of srch with repl in string.
    """
    resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
                          + srch + """])|(?P<rbrkt>[)\]}])""")
    return resrchrepl.sub(_subfact(repl), string)


def _subfact(repl):
    """
    Replacement function factory for regex sub method in srchrepl.
    """
    level = 0
    qtflags = 0
    def subf(mo):
        nonlocal level, qtflags
        sepfound = mo.group('sep')
        if  sepfound:
            if level == 0 and qtflags == 0:
                return repl
            else:
                return mo.group(0)
        elif mo.group('lbrkt'):
            if qtflags == 0:
                level += 1
            return mo.group(0)
        elif mo.group('quote') == "'":
            qtflags ^= 1            # toggle bit 1
            return "'"
        elif mo.group('quote') == '"':
            qtflags ^= 2            # toggle bit 2
            return '"'
        elif mo.group('rbrkt'):
            if qtflags == 0:
                level -= 1
            return mo.group(0)
    return subf

如果您不关心方括号字符，则可以简化此代码。
假设您要使用竖线或竖线作为替换字符，则可以执行以下操作：

mylist = srchrepl(';', '|', mytext).split('|')

顺便说一句，这是nonlocal从Python 3.1使用的，如果需要，可以将其更改为global。

Question 14

通用解决方案：

import re
regex = '''(?:(?:[^{0}"']|"[^"]*(?:"|$)|'[^']*(?:'|$))+|(?={0}{0})|(?={0}$)|(?=^{0}))'''

delimiter = ';'
data2 = ''';field 1;"field 2";;'field;4';;;field';'7;'''
field = re.compile(regex.format(delimiter))
print(field.findall(data2))

输出：

['', 'field 1', '"field 2"', '', "'field;4'", '', '', "field';'7", '']

此解决方案：

捕获所有空组（包括开头和结尾）
适用于大多数流行的定界符，包括空格，制表符和逗号
将另一种类型的引号内的引号视为非特殊字符
如果遇到不匹配的不带引号的引号，则将行的其余部分视为带引号的

Question 15

尽管主题很旧，以前的答案也很有效，但我还是建议使用python实现自己的split函数。

如果您不需要处理大量字符串，并且可以轻松自定义，则效果很好。

这是我的功能：

# l is string to parse; 
# splitchar is the separator
# ignore char is the char between which you don't want to split

def splitstring(l, splitchar, ignorechar): 
    result = []
    string = ""
    ignore = False
    for c in l:
        if c == ignorechar:
            ignore = True if ignore == False else False
        elif c == splitchar and not ignore:
            result.append(string)
            string = ""
        else:
            string += c
    return result

这样就可以运行：

line= """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
splitted_data = splitstring(line, ';', '"')

结果：

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

优点是此函数可用于空字段以及字符串中任意数量的分隔符。

希望这可以帮助！

Question 16

无需拆分分隔符模式，只需捕获所需的内容即可：

>>> import re
>>> data = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> re.findall(r';([\'"][^\'"]+[\'"]|[^;]+)', ';' + data)
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ', ' part" 5']

Question 17

在我看来，这是一个半优雅的解决方案。

新解决方案：

import re
reg = re.compile('(\'|").*?\\1')
pp = re.compile('.*?;')
def splitter(string):
    #add a last semicolon
    string += ';'
    replaces = []
    s = string
    i = 1
    #replace the content of each quote for a code
    for quote in reg.finditer(string):
        out = string[quote.start():quote.end()]
        s = s.replace(out, '**' + str(i) + '**')
        replaces.append(out)
        i+=1
    #split the string without quotes
    res = pp.findall(s)

    #add the quotes again
    #TODO this part could be faster.
    #(lineal instead of quadratic)
    i = 1
    for replace in replaces:
        for x in range(len(res)):
            res[x] = res[x].replace('**' + str(i) + '**', replace)
        i+=1
    return res

旧解决方案：

我选择匹配是否有开头的引号，然后等待其关闭，然后匹配结束的分号。您要匹配的每个“部分”都必须以分号结尾。所以这匹配这样的事情：

'foobar; .sska';
“ akjshd; asjkdhkj ..”，
asdkjhakjhajsd.jhdf;

码：

mm = re.compile('''((?P<quote>'|")?.*?(?(quote)\\2|);)''')
res = mm.findall('''part 1;"this is ; part 2;";'this is ; part 3';part 4''')

您可能需要对资源进行一些后处理，但其中包含您想要的内容。