re.findall（'（ab | cd）'，字符串）vs re.findall（'（ab | cd）+'，字符串）

在Python正则表达式中，我遇到了这个单一问题。您能否说明re.findall('(ab|cd)', string)和之间的区别re.findall('(ab|cd)+', string)？

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

实际输出为：

['ab', 'cd']
['cd']

我很困惑为什么第二个结果也没有包含'ab'？

python regex

— 岩石
source

re.findall（'（ab | cd）'，string）得到['ab'，'cd'] re.findall（'（ab | cd）+'，string）得到['cd']

— 摇滚

Answers:

+是一次或多次匹配的重复量词。在regex中(ab|cd)+，您将使用+ 重复捕获组 (ab|cd)。这只会捕获最后的迭代。

您可以对此行为做出如下推断：

假设您的字符串是，abcdla而正则表达式为(ab|cd)+。正则表达式引擎将在位置0和1之间找到该组的匹配项，ab然后退出捕获组。然后，它看到了+量词，因此尝试再次捕获该组，并将捕获cd位置2和3之间。

如果你想捕获所有迭代，你应该捕捉重复组以代替((ab|cd)+)其比赛abcd和cd。您可以使内部群组不被捕获，因为我们不在乎与((?:ab|cd)+)哪些内部群组匹配abcd

https://www.regular-expressions.info/captureall.html

在文档中，

假设您要匹配类似!abc!或的标签!123!。只有这两个是可能的，并且您想捕获abc或123找出您获得了哪个标签。这很容易：!(abc|123)!会成功的。

现在说标签可以包含abcand的多个序列123，例如!abc123!or !123abcabc!。快速简便的解决方案是 !(abc|123)+!。这个正则表达式确实会匹配这些标签。但是，它不再符合我们将标签的标签捕获到捕获组中的要求。当此正则表达式匹配时!abc123!，捕获组仅存储123。当它匹配时!123abcabc!，它仅存储abc。

— 沙申克五世
source

您可以链接到某个文档，以明确+只能捕获最后一次迭代的事实，什么是捕获组？

— Gulzar

@Gulzar，更新了答案。您可以在此处阅读有关捕获组的信息-Regular-expressions.info/refcapture.html

— Shashank V

@Shashank，谢谢，您的答复正是我所需要的。真诚的感谢

— rock

@rock如果解决了您的问题，请接受答案。

— Shashank V

无需将整个正则表达式用括号括起来。只是'(?:ab|cd)+'会工作。

— Dukeling

我不知道这是否还会清除更多内容，但让我们尝试以简单的方式想象幕后发生的事情，我们将使用match总结发生的事情

   # group(0) return the matched string the captured groups are returned in groups or you can access them
   # using group(1), group(2).......  in your case there is only one group, one group will capture only 
   # one part so when you do this
   string = 'abcdla'
   print(re.match('(ab|cd)', string).group(0))  # only 'ab' is matched and the group will capture 'ab'
   print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd'  the group will capture only this part 'cd' the last iteration

findall同时匹配并使用字符串，让我们想象一下此REGEX会发生什么'(ab|cd)'：

      'abcdabla' ---> 1:   match: 'ab' |  capture : ab  | left to process:  'cdabla'
      'cdabla'   ---> 2:   match: 'cd' |  capture : cd  | left to process:  'abla'
      'abla'     ---> 3:   match: 'ab' |  capture : ab  | left to process:  'la'
      'la'       ---> 4:   match: '' |  capture : None  | left to process:  ''

      --- final : result captured ['ab', 'cd', 'ab']

现在同样的事情 '(ab|cd)+'

      'abcdabla' ---> 1:   match: 'abcdab' |  capture : 'ab'  | left to process:  'la'
      'la'       ---> 2:   match: '' |  capture : None  | left to process:  ''
      ---> final result :   ['ab']

我希望这能使事情有所清除。

— Charif DZ
source

所以，令我感到困惑的是，

如果该模式中存在一个或多个组，则返回一个组列表；否则，返回一个列表。

docs

因此返回的不是完全匹配，而是仅包含捕获的匹配。如果您不捕获此组(re.findall('(?:ab|cd)+', string)，它将["abcd"]按我最初的预期返回

— 放射病
source

不确定您是否还期望什么

— RiaD