我有一个文本文件。我需要得到一个句子清单。
如何实现呢?有很多微妙之处,例如缩写中使用了点。
我的旧正则表达式效果很差:
re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
我有一个文本文件。我需要得到一个句子清单。
如何实现呢?有很多微妙之处,例如缩写中使用了点。
我的旧正则表达式效果很差:
re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
Answers:
自然语言工具包(nltk.org)满足您的需求。 该群组发布表明这样做:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(我还没有尝试过!)
nltk .tokenize.punkt.PunktSentenceTokenizer
。
nltk.download()
并下载模型->punkt
'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.'
,我的输出['This fails on cases with ending quotation marks.', 'If we have a sentence that ends like "this."', 'This is another sentence.']
似乎对我来说是正确的。
此功能可以在约0.1秒内将Huckleberry Finn的整个文本拆分成句子,并处理许多使句子解析变得不平凡的更痛苦的案例,例如“ John Johnson Jr.先生出生于美国,但获得了博士学位。 D.在以色列加入耐克公司的工程师之前,他还曾在craigslist.org业务分析师。 “
# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
def split_into_sentences(text):
text = " " + text + " "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
,websites = "[.](com|net|org|io|gov|me|edu)"
,和if "..." in text: text = text.replace("...","<prd><prd><prd>")
除了使用正则表达式将文本拆分为句子外,还可以使用nltk库。
>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
for sentence in tokenize.sent_tokenize(text): print(sentence)
您可以尝试使用Spacy代替正则表达式。我用它就可以了。
import spacy
nlp = spacy.load('en')
text = '''Your text here'''
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
这是不依赖任何外部库的中间方法。我使用列表推导来排除缩写词和终止符之间的重叠以及排除终止符变体之间的重叠,例如:“。” 与“。”
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
我从以下条目中使用了Karl的find_all函数: 查找Python中所有出现的子字符串
...
和?!
。
对于简单的情况(句子正常终止),这应该起作用:
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
regex是*\. +
,它匹配一个由左0或更多空格和右1或更多空格包围的句点(以防止re.split的句点被视为句子的变化)。
显然,这不是最可靠的解决方案,但是在大多数情况下,它会很好用。唯一无法解决的情况就是缩写(也许遍历句子列表并检查每个字符串中是否都sentences
以大写字母开头?)
SyntaxError: EOL while scanning string literal
,指向右括号(在之后text
)。另外,您在文本中引用的正则表达式在代码示例中不存在。
r' *[\.\?!][\'"\)\]]* +'
您还可以在NLTK中使用句子标记化功能:
from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes. Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."
sent_tokenize(sentence)
@Artyom,
嗨!您可以使用此功能为俄语(和其他一些语言)创建新的令牌生成器:
def russianTokenizer(text):
result = text
result = result.replace('.', ' . ')
result = result.replace(' . . . ', ' ... ')
result = result.replace(',', ' , ')
result = result.replace(':', ' : ')
result = result.replace(';', ' ; ')
result = result.replace('!', ' ! ')
result = result.replace('?', ' ? ')
result = result.replace('\"', ' \" ')
result = result.replace('\'', ' \' ')
result = result.replace('(', ' ( ')
result = result.replace(')', ' ) ')
result = result.replace(' ', ' ')
result = result.replace(' ', ' ')
result = result.replace(' ', ' ')
result = result.replace(' ', ' ')
result = result.strip()
result = result.split(' ')
return result
然后以这种方式调用它:
text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)
祝你好运,Marilena。
毫无疑问,NLTK最适合此目的。但是开始使用NLTK会很痛苦(但是一旦安装,您就可以得到回报)
因此,这是位于http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html的简单的基于re的代码
# split up a paragraph into sentences
# using regular expressions
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = """This is a sentence. This is an excited sentence! And do you think this is a question?"""
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
#output:
# This is a sentence
# This is an excited sentence
# And do you think this is a question
我必须阅读字幕文件,并将其拆分为句子。经过预处理(如删除.srt文件中的时间信息等)后,变量fullFile包含字幕文件的全文。下面的粗略方法将它们整齐地分成句子。可能我很幸运,句子总是(正确)以空格结尾。首先尝试执行此操作,如果有任何例外,请添加更多的检查和余额。
# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
sentFile.write (line);
sentFile.write ("\n");
sentFile.close;
哦! 好。我现在意识到,由于我的内容是西班牙语,所以我没有遇到与“史密斯先生”等人打交道的问题。但是,如果有人想要快速又脏的解析器...
我希望这对您使用拉丁文,中文,阿拉伯文有帮助
import re
punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []
with open('myData.txt','r',encoding="utf-8") as myFile:
lines = punctuation.sub(r"\1\2<pad>", myFile.read())
lines = [line.strip() for line in lines.split("<pad>") if line.strip()]
通过执行一些链接并为nltk进行了一些练习,正在处理类似的任务并遇到了此查询,下面的代码对我来说就像魔术一样。
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)
输出:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
资料来源:https : //www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/