如何在特定子字符串之后获取字符串?


226

如何在特定子字符串之后获取字符串?

例如,我想"world"my_string="hello python world , i'm a beginner "

Answers:


399

最简单的方法可能只是分割目标词

my_string="hello python world , i'm a beginner "
print my_string.split("world",1)[1] 

split使用要拆分的单词(或字符),并且可以选择限制拆分次数。

在此示例中,对“世界”进行拆分并将其限制为仅一个拆分。


如果我需要用“低”字分割文本,并且在它之前包含低位字,这将不起作用!
Leonardo Hermoso

1
您会简单地拆分2xtarget.split('lower',1)[-1].split('low',1)[-1]
Joran Beasley,

如果句子是“ hello python Megaworld world,我是初学者”,该怎么办?我如何看待整个单词,而不是另一个单词“ Megaworld”?谢谢
pbou

1
那么您搜索的字符串是“世界” ...或使用正则表达式来限制单词
Joran Beasley,

6
my_string.partition("world")[-1](或...[2])更快。
马丁·彼得

66
s1 = "hello python world , i'm a beginner "
s2 = "world"

print s1[s1.index(s2) + len(s2):]

如果你要处理的情况下s2存在的s1,然后使用s1.find(s2)而不是index。如果调用的返回值-1,那么s2是不是在s1


您会得到不同的ID(相隔数千个)...我不确定您是否不会为此创建不必要的子字符串
Joran Beasley

@JoranBeasley,我们仅调用index(),len()和slice。没有理由让index()和len()创建子字符串,如果它们创建了子字符串(我很难相信),那只是不必要的实现细节。对于slice来说也一样-除了返回的字符串以外,没有其他理由创建子字符串。
shx2

@ shx2print( s1[s1.index(s2) + len(s2):] is s1[s1.index(s2) + len(s2):])
Joran Beasley,

@JoranBeasley您打算用这个片段说明什么?那在多个调用中返回了不同的对象?所谓“不必要的子串”,是指除返回的子串以外的其他子串,即不需要为获得结果而创建的子串。
shx2

56

我很惊讶没有人提及partition

def substring_after(s, delim):
    return s.partition(delim)[2]

恕我直言,此解决方案比@arshajii更具可读性。除此之外,我认为@arshajii最好的是最快的-它不会创建任何不必要的副本/子字符串。


2
这是一个不错的解决方案,可以很好地处理子字符串不属于基本字符串的情况。
mattmc3

您会得到不同的ID(相隔数千个)...我不确定您是否不会为此创建不必要的子字符串(而且我太懒了以至于无法正确配置它)
Joran Beasley

1
@JoranBeasley,它显然确实会创建不必要的内容。我认为您误解了我的答案。
shx2

(我认为
Joran Beasley,

3
而且,这比str.split(..., 1)
马丁·彼得斯

20

您要使用str.partition()

>>> my_string.partition("world")[2]
" , i'm a beginner "

因为此选项比其他选项要

请注意,如果缺少分隔符,则会生成一个空字符串:

>>> my_string.partition("Monty")[2]  # delimiter missing
''

如果要使用原始字符串,请测试从中返回的第二个str.partition()是否非空:

prefix, success, result = my_string.partition(delimiter)
if not success: result = prefix

您也可以使用str.split()1:

>>> my_string.split("world", 1)[-1]
" , i'm a beginner "
>>> my_string.split("Monty", 1)[-1]  # delimiter missing
"hello python world , i'm a beginner "

但是,此选项较慢。在最佳情况下,与相比,str.partition()轻松15%str.split()

                                missing        first         lower         upper          last
      str.partition(...)[2]:  [3.745 usec]  [0.434 usec]  [1.533 usec]  <3.543 usec>  [4.075 usec]
str.partition(...) and test:   3.793 usec    0.445 usec    1.597 usec    3.208 usec    4.170 usec
      str.split(..., 1)[-1]:  <3.817 usec>  <0.518 usec>  <1.632 usec>  [3.191 usec]  <4.173 usec>
            % best vs worst:         1.9%         16.2%          6.1%          9.9%          2.3%

这显示了使用输入的每次执行的时间,此处缺少分隔符(最坏情况),放在最前面(最佳情况)或位于下半部,上半部或最后位置。最快的时间标有[...]<...>而最坏的则标有。

上表是针对以下所有三个选项的综合时间试用得出的。我在带有2.9 GHz Intel Core i7和16 GB ram的2017年型号15“ Macbook Pro上的Python 3.7.4上运行了测试。

该脚本会生成带有或不带有随机选择的定界符的随机语句,如果存在,则在生成的语句中的不同位置,以重复的随机顺序运行测试(产生最合理的结果,说明测试期间发生的随机OS事件),然后打印结果表:

import random
from itertools import product
from operator import itemgetter
from pathlib import Path
from timeit import Timer

setup = "from __main__ import sentence as s, delimiter as d"
tests = {
    "str.partition(...)[2]": "r = s.partition(d)[2]",
    "str.partition(...) and test": (
        "prefix, success, result = s.partition(d)\n"
        "if not success: result = prefix"
    ),
    "str.split(..., 1)[-1]": "r = s.split(d, 1)[-1]",
}

placement = "missing first lower upper last".split()
delimiter_count = 3

wordfile = Path("/usr/dict/words")  # Linux
if not wordfile.exists():
    # macos
    wordfile = Path("/usr/share/dict/words")
words = [w.strip() for w in wordfile.open()]

def gen_sentence(delimiter, where="missing", l=1000):
    """Generate a random sentence of length l

    The delimiter is incorporated according to the value of where:

    "missing": no delimiter
    "first":   delimiter is the first word
    "lower":   delimiter is present in the first half
    "upper":   delimiter is present in the second half
    "last":    delimiter is the last word

    """
    possible = [w for w in words if delimiter not in w]
    sentence = random.choices(possible, k=l)
    half = l // 2
    if where == "first":
        # best case, at the start
        sentence[0] = delimiter
    elif where == "lower":
        # lower half
        sentence[random.randrange(1, half)] = delimiter
    elif where == "upper":
        sentence[random.randrange(half, l)] = delimiter
    elif where == "last":
        sentence[-1] = delimiter
    # else: worst case, no delimiter

    return " ".join(sentence)

delimiters = random.choices(words, k=delimiter_count)
timings = {}
sentences = [
    # where, delimiter, sentence
    (w, d, gen_sentence(d, w)) for d, w in product(delimiters, placement)
]
test_mix = [
    # label, test, where, delimiter sentence
    (*t, *s) for t, s in product(tests.items(), sentences)
]
random.shuffle(test_mix)

for i, (label, test, where, delimiter, sentence) in enumerate(test_mix, 1):
    print(f"\rRunning timed tests, {i:2d}/{len(test_mix)}", end="")
    t = Timer(test, setup)
    number, _ = t.autorange()
    results = t.repeat(5, number)
    # best time for this specific random sentence and placement
    timings.setdefault(
        label, {}
    ).setdefault(
        where, []
    ).append(min(dt / number for dt in results))

print()

scales = [(1.0, 'sec'), (0.001, 'msec'), (1e-06, 'usec'), (1e-09, 'nsec')]
width = max(map(len, timings))
rows = []
bestrow = dict.fromkeys(placement, (float("inf"), None))
worstrow = dict.fromkeys(placement, (float("-inf"), None))

for row, label in enumerate(tests):
    columns = []
    worst = float("-inf")
    for p in placement:
        timing = min(timings[label][p])
        if timing < bestrow[p][0]:
            bestrow[p] = (timing, row)
        if timing > worstrow[p][0]:
            worstrow[p] = (timing, row)
        worst = max(timing, worst)
        columns.append(timing)

    scale, unit = next((s, u) for s, u in scales if worst >= s)
    rows.append(
        [f"{label:>{width}}:", *(f" {c / scale:.3f} {unit} " for c in columns)]
    )

colwidth = max(len(c) for r in rows for c in r[1:])
print(' ' * (width + 1), *(p.center(colwidth) for p in placement), sep="  ")
for r, row in enumerate(rows):
    for c, p in enumerate(placement, 1):
        if bestrow[p][1] == r:
            row[c] = f"[{row[c][1:-1]}]"
        elif worstrow[p][1] == r:
            row[c] = f"<{row[c][1:-1]}>"
    print(*row, sep="  ")

percentages = []
for p in placement:
    best, worst = bestrow[p][0], worstrow[p][0]
    ratio = ((worst - best) / worst)
    percentages.append(f"{ratio:{colwidth - 1}.1%} ")

print("% best vs worst:".rjust(width + 1), *percentages, sep="  ")

好答案!特别是因为您提供了更好的真正原因:P
Joran Beasley,

18

如果您想使用regex进行此操作,则可以简单地使用一个非捕获组,以获得“ world”一词,然后再获取所有内容,例如

(?:world).*

示例字符串在这里进行了测试


28
有些人遇到问题时会认为“我知道,我会使用正则表达式”。...现在您有2个问题...
Joran Beasley 2012年

2
哈哈,我的错,我以为这是正则表达式的标签,所以我尝试给出正则表达式的答案。哦,好,它在那里。
塔德

1
一切都很好...这肯定是给这只猫剥皮的一种方法...不过这个问题的矫
kill过正

非捕获组链接不再指向正确的事物。
Apteryx

1
对于那些感兴趣。这是完整的代码result = re.search(r"(?:world)(.*)", "hello python world , i'm a beginner ").group(1)
RaduS

5

您可以使用称为“子字符串”的程序包。只需输入“ pip install substring”。您只需提及开始和结束字符/索引即可获得子字符串。

例如:

import substring

s = substring.substringByChar("abcdefghijklmnop", startChar="d", endChar="n")

print(s)

输出:

s = defghijklmn


3

这是一个古老的问题,但是我遇到了非常相似的情况,我需要使用“ low”一词作为半字形来拆分一个字符串,对我来说,问题是我在同一字符串中具有“ lower”和“ lower”这个词。

我这样用re模块解决了

import re

string = '...below...as higher prices mean lower demand to be expected. Generally, a high reading is seen as negative (or bearish), while a low reading is seen as positive (or bullish) for the Korean Won.'

使用带有正则表达式的re.split来匹配确切的单词

stringafterword = re.split('\\blow\\b',string)[-1]
print(stringafterword)
' reading is seen as positive (or bullish) for the Korean Won.'

通用代码是:

re.split('\\bTHE_WORD_YOU_WANT\\b',string)[-1]

希望这可以帮助某人!


1
也许你也可以只使用:string.partition(" low ")[2]?(请注意low
Mtl Dev

1

尝试以下一般方法:

import re
my_string="hello python world , i'm a beginner "
p = re.compile("world(.*)")
print (p.findall(my_string))

#[" , i'm a beginner "]

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.