如何找到所有出现的子串？

365

Python具有string.find()和string.rfind()获取字符串中子字符串的索引。

我想知道是否有类似的东西string.find_all()可以返回所有找到的索引（不仅是开头的第一个，还是结尾的第一个）。

例如：

string = "test test test test"

print string.find('test') # 0
print string.rfind('test') # 15

#this is the goal
print string.find_all('test') # [0,5,10,15]

python regex string

— 努克
source

11

应该'ttt'.find_all('tt')返回什么？

— 圣地亚哥·亚历山德里

2

它应该返回“ 0”。当然，在理想世界中也必须存在'ttt'.rfind_all('tt')，它应该返回“ 1”

— nukl 2011年

2

好像是这个stackoverflow.com/questions/3873361/…

— 珠穆朗玛峰，2016年

523

没有简单的内置字符串函数可以满足您的需求，但是您可以使用功能更强大的正则表达式：

import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]

如果要查找重叠的匹配项，先行搜索将做到：

[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]

如果您想要一个没有重叠的反向查找全部，则可以将正向和负向超前组合成这样的表达式：

search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]

re.finditer返回一个generator，所以您可以更改[]上述内容以()获取一个Generator而不是一个列表，如果只迭代一次结果，则列表会更有效。

— 莫尼丁
source

嗨，关于这个[m.start() for m in re.finditer('test', 'test test test test')]，我们怎么寻找test还是text？它会变得更加复杂吗？

— xpanta

7

您想大致研究一下正则表达式：docs.python.org/2/howto/regex.html。您问题的解决方案将是：[re.finditer（'te [sx] t'，'文本测试文本测试'）中m的m.start（）]

— Yotam Vaknin 2014年

1

使用此方法的时间复杂度是多少？

— Pranjal Mittal

1

@PranjalMittal。上限还是下限？最好，最坏还是一般？

— 疯狂物理学家'11

@marcog如果子字符串包含括号或其他特殊字符怎么办？

— Bananach

109

>>> help(str.find)
Help on method_descriptor:

find(...)
    S.find(sub [,start [,end]]) -> int

因此，我们可以自己构建它：

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]

不需要临时字符串或正则表达式。

— 卡尔·克内希特尔
source

22

要获得重叠的匹配，只需将替换start += len(sub)为即可start += 1。

— Karl Knechtel

4

我相信您以前的评论应该是您回答中的附言。

— tzot 2011年

1

您的代码不适用于查找substr：“ GATATATGCATATACTT”中的“ ATAT”

— Ashish Negi

2

另请参阅我的评论。那是重叠比赛的一个例子。

— Karl Knechtel

4

为了匹配的行为re.findall，我建议添加len(sub) or 1而不是len(sub)，否则此生成器将永远不会终止于空子字符串。

— WGH 2015年

45

这是一种获取所有（甚至重叠）匹配项的方法（效率很低）：

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

— Thkala
source

25

同样，旧线程，但这是我使用生成器和plain的解决方案str.find。

def findall(p, s):
    '''Yields all the positions of
    the pattern p in the string s.'''
    i = s.find(p)
    while i != -1:
        yield i
        i = s.find(p, i+1)

例

x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]

退货

[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]

— 阿基罗斯
source

3

这看起来很美！

— fabio.sang

21

您可以将其re.finditer()用于非重叠匹配。

>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]

但不适用于：

In [1]: aString="ababa"

In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]

— 钦美干地
source

12

为什么要从迭代器中列出列表，这只会减慢该过程。

— pradyunsg

2

aString VS astring;）

— NexD。

18

来吧，让我们一起递归。

def locations_of_substring(string, substring):
    """Return a list of locations of a substring."""

    substring_length = len(substring)    
    def recurse(locations_found, start):
        location = string.find(substring, start)
        if location != -1:
            return recurse(locations_found + [location], location+substring_length)
        else:
            return locations_found

    return recurse([], 0)

print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]

这样就不需要正则表达式。

— 科迪·皮尔索尔
source

我刚刚开始怀疑“是否有一种在python中的字符串中找到子字符串的理想方法” ...然后在谷歌搜索5分钟后，我找到了您的代码。感谢分享！！！

— Geparada

3

这段代码有几个问题。由于它迟早会处理开放式数据，因此RecursionError如果出现的次数足够多，您会碰到的。另一个是在每次迭代时创建的两个废弃列表，仅是为了添加一个元素，这对于字符串查找功能来说不是很理想，可能会被调用很多次。尽管有时递归函数看起来优雅而清晰，但应谨慎使用它们。

— 伊万·尼古拉耶夫

11

如果您只是寻找一个字符，这将起作用：

string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7

也，

string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4

我的直觉是，这些（尤其是第二名）都没有表现出色。

— 斯塔布
source

GR8解决方案..我用用..分裂（）的印象

— 山塔努帕塔克

9

这是一个老话题，但是我很感兴趣，想分享我的解决方案。

def find_all(a_string, sub):
    result = []
    k = 0
    while k < len(a_string):
        k = a_string.find(sub, k)
        if k == -1:
            return result
        else:
            result.append(k)
            k += 1 #change to k += len(sub) to not search overlapping results
    return result

它应该返回找到子字符串的位置列表。如果您发现错误或需要改进的地方，请发表评论。

— 苏里恩
source

6

这使用re.finditer对我有用

import re

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text

find_the_word = re.finditer('as', text)

for match in find_the_word:
    print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))

— 布鲁诺·韦尔穆伦
source

5

这个线程有点旧，但是对我有用：

numberString = "onetwothreefourfivesixseveneightninefiveten"
testString = "five"

marker = 0
while marker < len(numberString):
    try:
        print(numberString.index("five",marker))
        marker = numberString.index("five", marker) + 1
    except ValueError:
        print("String not found")
        marker = len(numberString)

— 安德鲁·H
source

5

你可以试试：

>>> string = "test test test test"
>>> for index,value in enumerate(string):
    if string[index:index+(len("test"))] == "test":
        print index

0
5
10
15

— 哈莎·比雅妮（Harsha Biyani）
source

2

无论其他人提供的解决方案完全基于可用的方法find（）或任何可用的方法。

查找字符串中所有子字符串出现的核心基本算法是什么？

def find_all(string,substring):
    """
    Function: Returning all the index of substring in a string
    Arguments: String and the search string
    Return:Returning a list
    """
    length = len(substring)
    c=0
    indexes = []
    while c < len(string):
        if string[c:c+length] == substring:
            indexes.append(c)
        c=c+1
    return indexes

您也可以将str类继承到新类，并可以在下面使用此函数。

class newstr(str):
def find_all(string,substring):
    """
    Function: Returning all the index of substring in a string
    Arguments: String and the search string
    Return:Returning a list
    """
    length = len(substring)
    c=0
    indexes = []
    while c < len(string):
        if string[c:c+length] == substring:
            indexes.append(c)
        c=c+1
    return indexes

调用方法

newstr.find_all（'您觉得这个答案有用吗？然后投票！'，'this'）

— 纳文·拉贾
source

2

此函数不会查看字符串内部的所有位置，也不会浪费计算资源。我的尝试：

def findAll(string,word):
    all_positions=[]
    next_pos=-1
    while True:
        next_pos=string.find(word,next_pos+1)
        if(next_pos<0):
            break
        all_positions.append(next_pos)
    return all_positions

使用它的方式是这样的：

result=findAll('this word is a big word man how many words are there?','word')

— 瓦伦丁·古赫曼
source

1

在文档中查找大量关键字时，请使用flashtext

from flashtext import KeywordProcessor
words = ['test', 'exam', 'quiz']
txt = 'this is a test'
kwp = KeywordProcessor()
kwp.add_keywords_from_list(words)
result = kwp.extract_keywords(txt, span_info=True)

在大量搜索词中，Flashtext的运行速度比正则表达式快。

— 乌里·戈伦（Uri Goren）
source

0

src = input() # we will find substring in this string
sub = input() # substring

res = []
pos = src.find(sub)
while pos != -1:
    res.append(pos)
    pos = src.find(sub, pos + 1)

— 马斯凯
source

1

尽管此代码可以解决OP的问题，但最好包括有关您的代码如何解决OP的问题的说明。这样，将来的访问者可以从您的帖子中学习，并将其应用于自己的代码。SO不是编码服务，而是知识资源。此外，更可能会推荐高质量，完整的答案。这些功能，以及所有职位必须自成体系的要求，是SO作为平台的强项，可以将其与论坛区分开。您可以编辑以添加其他信息和/或在源文档中补充说明

— SherylHohman

0

这是来自hackerrank的类似问题的解决方案。希望对您有所帮助。

import re
a = input()
b = input()
if b not in a:
    print((-1,-1))
else:
    #create two list as
    start_indc = [m.start() for m in re.finditer('(?=' + b + ')', a)]
    for i in range(len(start_indc)):
        print((start_indc[i], start_indc[i]+len(b)-1))

输出：

aaadaa
aa
(0, 1)
(1, 2)
(4, 5)

— 鲁曼·汗
source

-1

通过切片，我们找到了所有可能的组合，并将它们附加在列表中，并使用count函数查找了发生的次数

s=input()
n=len(s)
l=[]
f=input()
print(s[0])
for i in range(0,n):
    for j in range(1,n+1):
        l.append(s[i:j])
if f in l:
    print(l.count(f))

— 邦塔·斯里维迪娅
source

什么时候s="test test test test"和f="test"您的代码可以打印4，但是可以预期[0,5,10,15]

— barbsan

只写了一个单词就会更新代码

— BONTHA SREEVIDHYA '19

-2

请看下面的代码

#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''


def get_substring_indices(text, s):
    result = [i for i in range(len(text)) if text.startswith(s, i)]
    return result


if __name__ == '__main__':
    text = "How much wood would a wood chuck chuck if a wood chuck could chuck wood?"
    s = 'wood'
    print get_substring_indices(text, s)

— 黄哥Python培训
source

-2

pythonic的方式是：

mystring = 'Hello World, this should work!'
find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# s represents the search string
# c represents the character string

find_all(mystring,'o')    # will return all positions of 'o'

[4, 7, 20, 26] 
>>>

— 哈维
source

3

1）这如何帮助7年前回答的问题？2）使用lambda这种方法不是Python的，并且与PEP8背道而驰。3）这不能为OP的情况提供正确的输出

— Wondercricket '18

Pythonic并不意味着“使用尽可能多的python功能”

— klutt

-2

您可以轻松使用：

string.count('test')!

https://www.programiz.com/python-programming/methods/string/count

干杯!

— 雷·萨拉瓦瓦
source

这应该是答案

— Maxwell Chandler

8

字符串count（）方法返回给定字符串中子字符串出现的次数。不是他们的位置。

— 阿斯特丽德

5

这并不满足所有情况，s ='banana'，sub ='ana'。Sub在这种情况下发生了两次，但是执行s.sub（'ana'）会返回1

— Joey daniel darko