如何将文本分成句子?


108

我有一个文本文件。我需要得到一个句子清单。

如何实现呢?有很多微妙之处,例如缩写中使用了点。

我的旧正则表达式效果很差:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

18
定义“句子”。
martineau 2011年

我想这样做,但是我想在有句号或换行符的地方进行拆分
yishairasowsky 19/12/30

Answers:


152

自然语言工具包(nltk.org)满足您的需求。 该群组发布表明这样做:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(我还没有尝试过!)



4
@Artyom:这是的在线文档的直接链接nltk .tokenize.punkt.PunktSentenceTokenizer
martineau 2011年

10
您可能必须先执行nltk.download()并下载模型->punkt
Martin Thoma 2015年

2
对于带引号结尾的情况,此操作将失败。如果我们的句子以“ this”结尾。
福萨

1
好吧,你说服了我。但是我只是测试了一下,它似乎并没有失败。我的输入是'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.',我的输出['This fails on cases with ending quotation marks.', 'If we have a sentence that ends like "this."', 'This is another sentence.']似乎对我来说是正确的。
szedjani

100

此功能可以在约0.1秒内将Huckleberry Finn的整个文本拆分成句子,并处理许多使句子解析变得不平凡的更痛苦的案例,例如“ John Johnson Jr.先生出生于美国,但获得了博士学位。 D.在以色列加入耐克公司的工程师之前,他还曾在craigslist.org业务分析师。

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

19
这是一个很棒的解决方案。但是,我在正则表达式的声明中又添加了两行digits =“([[0-9])”,text = re.sub(digits +“ [。]” + digits,“ \\ 1 <prd> \ \ 2“,text)。现在,它不会将行拆分为十进制数(例如5.5)。感谢您的回答。
Ameya Kulkarni

1
您是如何解析整个Huckleberry Fin的?文本格式在哪里?
PascalVKooten

6
一个很好的解决方案。在函数中,如果在文本中添加了“ ie”,则添加了文本中的“例如”:text = text.replace(“ eg”,“ e <prd> g <prd>”) ,“ i <prd> e <prd>”),它完全解决了我的问题。
Sisay Chala

3
很棒的解决方案,有非常有用的评论!只是为了让它多了几分稳健,但:prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"websites = "[.](com|net|org|io|gov|me|edu)",和if "..." in text: text = text.replace("...","<prd><prd><prd>")
Dascienz

1
可以使此功能看作一句话吗:当一个孩子问妈妈“婴儿从哪里来?”时,一个人应该回答她什么?
twhale '18

50

除了使用正则表达式将文本拆分为句子外,还可以使用nltk库。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

参考:https : //stackoverflow.com/a/9474645/2877052


比公认的答案更好,更简单,更可重用的示例。
杰伊D.19年

如果在点后删除空格,tokenize.sent_tokenize()不起作用,但tokenizer.tokenize()起作用!嗯...
Leonid Ganeline,

1
for sentence in tokenize.sent_tokenize(text): print(sentence)
Victoria Stuart

11

您可以尝试使用Spacy代替正则表达式。我用它就可以了。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

1
巨大的空间。但是,如果您只需要分隔成句子,则在处理数据管道时将文本传递到太空将花费太长时间
Berlines

@Berlines我同意,但是找不到其他像spaCy一样干净的库。但是,如果您有任何建议,我可以尝试。
精灵

另外,对于在那里的AWS Lambda Serverless用户,spacy的支持数据文件很多100MB(英语大的文件大于400MB),因此,您不能开箱即用这样的东西(非常可惜Spacy的粉丝)
Julian H

9

这是不依赖任何外部库的中间方法。我使用列表推导来排除缩写词和终止符之间的重叠以及排除终止符变体之间的重叠,例如:“。” 与“。”

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

我从以下条目中使用了Karl的find_all函数: 查找Python中所有出现的子字符串


1
完美的方法!其他人不会捉到...?!
Shane Smiskol '16

6

对于简单的情况(句子正常终止),这应该起作用:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

regex是*\. +,它匹配一个由左0或更多空格和右1或更多空格包围的句点(以防止re.split的句点被视为句子的变化)。

显然,这不是最可靠的解决方案,但是在大多数情况下,它会很好用。唯一无法解决的情况就是缩写(也许遍历句子列表并检查每个字符串中是否都sentences以大写字母开头?)


29
您想不出英语中句子不以句号结尾的情况?设想!我的回答是,“再想一想”。(看看我在那儿做了什么?)
Ned Batchelder

@Ned哇,简直不敢相信我是那么愚蠢。我一定喝醉了。
拉菲·凯特勒

我在Win 7 x86上使用Python 2.7.2,上面代码中的regex给了我这个错误:SyntaxError: EOL while scanning string literal,指向右括号(在之后text)。另外,您在文本中引用的正则表达式在代码示例中不存在。
Sabuncu

1
正则表达式并不完全正确,应该是正确的r' *[\.\?!][\'"\)\]]* +'
fsociety

这可能会导致许多问题,并且还会将句子分块成小块。考虑一下我们有“我为这冰激凌支付了3.5美元”的情况,其中大块为“我为这冰激凌支付了3美元”和“为这冰激凌支付了5美元”。使用默认的nltk句子.tokenizer更安全!
Reihan_amn

6

您还可以在NLTK中使用句子标记化功能:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

2

@Artyom,

嗨!您可以使用此功能为俄语(和其他一些语言)创建新的令牌生成器:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

然后以这种方式调用它:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

祝你好运,Marilena。


0

毫无疑问,NLTK最适合此目的。但是开始使用NLTK会很痛苦(但是一旦安装,您就可以得到回报)

因此,这是位于http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html的简单的基于re的代码

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 

3
是的,但这很容易失败,并带有:“史密斯先生知道这是一个句子。”
托马斯2014年

0

我必须阅读字幕文件,并将其拆分为句子。经过预处理(如删除.srt文件中的时间信息等)后,变量fullFile包含字幕文件的全文。下面的粗略方法将它们整齐地分成句子。可能我很幸运,句子总是(正确)以空格结尾。首先尝试执行此操作,如果有任何例外,请添加更多的检查和余额。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

哦! 好。我现在意识到,由于我的内容是西班牙语,所以我没有遇到与“史密斯先生”等人打交道的问题。但是,如果有人想要快速又脏的解析器...


0

我希望这对您使用拉丁文,中文,阿拉伯文有帮助

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

0

通过执行一些链接并为nltk进行了一些练习,正在处理类似的任务并遇到了此查询,下面的代码对我来说就像魔术一样。

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 

输出:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

资料来源:https : //www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.