使用Python从HTML文件中提取文本

243

我想使用Python从HTML文件中提取文本。如果要从浏览器复制文本并将其粘贴到记事本中，我希望得到的输出基本上相同。

我想要比使用正则表达式更健壮的东西，因为正则表达式可能在格式不正确的HTML上失败。我见过很多人都推荐Beautiful Soup，但使用它时遇到了一些问题。例如，它拾取了不需要的文本，例如JavaScript源。此外，它没有解释HTML实体。例如，我希望＆＃39; 将HTML源代码中的HTML转换为文本中的撇号，就像我将浏览器内容粘贴到记事本中一样。

更新 html2text看起来很有希望。它正确处理HTML实体，并忽略JavaScript。但是，它不能完全产生纯文本；它会产生markdown，然后必须将其转换为纯文本。它没有示例或文档，但是代码看起来很干净。

相关问题：

— 约翰·D·库克
source

在相当长的一段时间内，人们似乎发现我的NLTK答案（最近）非常有用，因此，您可能需要考虑更改已接受的答案。谢谢！

— 2013年

1

我从未想到我会遇到我最喜欢的博客的作者提出的一个问题！努力！

— Ryan G

1

@Shatu现在您的解决方案不再有效，您可能想要删除您的评论。谢谢！;）

— Sнаđошƒаӽ2016年

136

html2text是一个Python程序，在此方面做得很好。

— 雷克斯
source

5

有点是gpl 3.0，这意味着它可能不兼容

— frog32

138

惊人！它的作者是RIP Aaron Swartz。

— Atul Arvind

2

有没有人因为GPL 3.0而找到html2text的替代品？

— jontsai 2014年

1

GPL并不像人们希望的那样糟糕。亚伦最了解。

— 2014年

2

我尝试了html2text和nltk，但是它们对我没有用。我最后选择了Beautiful Soup 4，它的表现非常出色（无双关语）。

— Ryan

149

我发现的用于提取文本而不获取JavaScript或不需要的东西的最佳代码是：

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

您只需要在以下之前安装BeautifulSoup：

pip install beautifulsoup4

— Y
source

2

如果我们要选择刚才提到的第3行，该怎么办？

— hepidad 2014年

3

杀人脚本有点救星！！

— 2014年

2

在经历了很多stackoverflow答案之后，我觉得这对我来说是最好的选择。我遇到的一个问题是在某些情况下将行添加在一起。我可以通过在get_text函数中添加分隔符来克服它：text = soup.get_text(separator=' ')

— Joswin KJ 2015年

5

而不是soup.get_text()我使用soup.body.get_text()，这样我就不会从<head>元素中得到任何文本，例如标题。

— Sjoerd

10

对于Python 3，from urllib.request import urlopen

— Jacob Kalakal Joseph

99

注意： NTLK不再支持clean_html功能

以下是原始答案，在评论部分提供了替代方法。

使用NLTK

我浪费了4-5个小时来解决html2text的问题。幸运的是我遇到了NLTK。
它神奇地工作。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

— 沙图
source

8

有时就足够了：)

— Sharmila 2012年

8

我想投票一千次。我陷入了正则表达式的地狱，但是，现在我看到了NLTK的智慧。

— BenDundee

26

显然，clean_html不再受支持：github.com/nltk/nltk/commit/…–

— alexanderlukanin13

5

导入像nltk这样的繁重的库来完成这么简单的任务将太多了

— richie

54

@ alexanderlukanin13来源：raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

— 克里斯·

54

发现自己今天面临同样的问题。我编写了一个非常简单的HTML解析器，以剥离所有标记的传入内容，并以最少的格式返回其余文本。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

— Xperroni
source

5

这似乎是仅使用默认模块的Python（2.7）中最简单的方法。这确实很愚蠢，因为这是通常需要的事情，并且没有充分的理由说明为什么默认的HTMLParser模块中没有用于此的解析器。

— Ingmar Hupp

2

我认为不会将html字符转换为unicode，对吗？例如，&将不会转换为&，对吗？

— Speedplane

对于Python 3使用from html.parser import HTMLParser

— sebhaase

14

这是xperroni答案的一个版本，更加完整。它跳过脚本和样式部分，并转换charrefs（例如＆＃39;）和HTML实体（例如＆amp;）。

它还包括一个普通的纯文本到html逆转换器。

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

— 位4
source

python 3版本：gist.github.com/Crazometer/af441bc7dc7353d41390a59f20f07b51

— Crazometer

在get_text中，''。join应该是''.join。应该有一个空白区域，否则某些文本将合并在一起。

— Obinna Nnenanya

1

此外，这将无法捕获所有文本，除非您包括其他文本容器标签，例如H1，H2 ....，span等。为了更好的覆盖范围，我不得不对其进行调整。

— Obinna Nnenanya

11

我知道已经有很多答案了，但是在这里部分描述了我找到的最优雅，最pythonic的解决方案。

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

更新资料

根据Fraser的评论，这是更优雅的解决方案：

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

— 弗洛伊德
source

2

为避免警告，请指定要使用的BeautifulSoup解析器：text = ''.join(BeautifulSoup(some_html_string, "lxml").findAll(text=True))

— Floyd

您可以使用stripped_strings生成器来避免过多的空格-即clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings

— Fraser '18

8

您也可以在带状图库中使用html2text方法。

from stripogram import html2text
text = html2text(your_html_string)

要安装带状图，请运行sudo easy_install stripogram

— 极客坦特拉
source

23

根据其pypi页面，该模块已被弃用：“除非您有使用该软件包的历史原因，否则我建议您不要这样做！”

— 直觉

7

有用于数据挖掘的模式库。

http://www.clips.ua.ac.be/pages/pattern-web

您甚至可以决定保留哪些标签：

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

— 嫩乔
source

6

PyParsing做得很好。PyParsing Wiki被杀死了，因此在另一个位置有使用PyParsing的示例（示例链接）。花费一些时间进行pyparsing的原因是，他还编写了一份非常简短且组织得很好的O'Reilly Short Cut手册，该手册也很便宜。

话虽如此，我经常使用BeautifulSoup，并且处理实体问题并不难，您可以在运行BeautifulSoup之前将其转换。

祝好运

— PyNEwbie
source

1

链接已死或变酸。

— 伊薇特

4

这并非完全是Python解决方案，但它将将Java生成的文本转换为文本，我认为这很重要（EG google.com）。浏览器链接（不是Lynx）具有Javascript引擎，并将使用-dump选项将源转换为文本。

因此，您可以执行以下操作：

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

— 安德鲁
source

4

代替htmlparser模块，签出htmllib。它具有类似的界面，但可以为您完成更多工作。（它非常古老，因此在摆脱javascript和css方面没有太大帮助。您可以创建派生类，但是可以添加名称为start_script和end_style的方法（有关详细信息，请参见python文档），但这很难为格式错误的html可靠地执行此操作。）无论如何，这是将纯文本打印到控制台的简单方法

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

— 标记
source

注意：HTMLError和HTMLParserError都应读取HTMLParseError。这可行，但是在维护换行符方面做得不好。

— 戴夫·奈特

4

我建议使用一个名为goose-extractor Goose的Python包，它将尝试提取以下信息：

文章的正文文章的主图像嵌入在文章中的任何Youtube / Vimeo电影Meta Description Meta标签

— 李英君
source

4

如果需要更高的速度和更低的准确性，则可以使用原始的lxml。

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

— 安东·谢林
source

4

使用安装html2text

点安装html2text

然后，

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

— 普拉维莎五世
source

4

我知道这里已经有很多答案，但我认为报纸3k也值得一提。我最近需要完成一个类似的任务，即从网络上的文章中提取文本，到目前为止，该库在我的测试中为实现这一目标做得非常出色。它会忽略在菜单项和侧栏中找到的文本以及随OP请求而出现在页面上的任何JavaScript。

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

如果您已经下载了HTML文件，则可以执行以下操作：

article = Article('')
article.set_html(html)
article.parse()
article.text

它甚至具有一些NLP功能，可用于概括文章的主题：

article.nlp()
article.summary

— spatel4140
source

3

美丽的汤不会转换html实体。考虑到HTML通常是越野车并充满unicode和html编码问题，这可能是您最好的选择。这是我用来将html转换为原始文本的代码：

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

— 速度飞机
source

3

另一个选择是通过基于文本的Web浏览器运行html并将其转储。例如（使用Lynx）：

lynx -dump html_to_convert.html > converted_html.txt

这可以在python脚本中完成，如下所示：

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

它不会完全为您提供HTML文件中的文本，但是取决于您的用例，它可能比html2text的输出更好。

— 约翰·卢卡斯
source

2

另一个非Python解决方案：Libre Office：

soffice --headless --invisible --convert-to txt input1.html

我之所以比其他选择更喜欢这个原因，是因为每个HTML段落都被转换成单个文本行（没有换行符），这正是我想要的。其他方法需要后处理。Lynx确实产生了不错的输出，但与我一直在寻找的输出不完全相同。此外，Libre Office可用于从各种格式转换...

— 雅科夫
source

2

有人尝试bleach.clean(html,tags=[],strip=True)过漂白剂吗？它为我工作。

— 大约
source

似乎也对我有用，但他们不建议将其用于此目的：“此功能是一个注重安全性的功能，其唯一目的是从字符串中删除恶意内容，以便可以将其显示为Web中的内容。页。” -> bleach.readthedocs.io/en/latest/clean.html#bleach.clean

— Loktopus

2

最适合我的是手稿。

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

结果真的很好

— Vim
source

2

我使用Apache Tika取得了不错的成绩。它的目的是从内容中提取元数据和文本，因此可以对底层的解析器进行相应的调整。

Tika可以作为服务器运行，在Docker容器中运行/部署很简单，并且可以通过Python绑定从中访问。

— 悠闲
source

1

以一种简单的方式

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

此代码查找html_text中所有以'<'结尾并以'>'结尾的部分，并将所有找到的部分替换为空字符串

— 大卫·弗拉加
source

1

@PeYoTIL使用BeautifulSoup并消除样式和脚本内容的答案对我不起作用。我尝试使用decompose代替，extract但仍然无法正常工作。因此，我创建了自己的文档，该文档还使用<p>标签设置了文本格式，并用<a>href链接替换了标签。还可以处理文本内的链接。可在此要点中嵌入测试文档。

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

— 狂想曲
source

1

谢谢，这个答案被低估了。对于我们这些希望拥有行为更像浏览器的简洁文本表示形式的用户（忽略换行符，仅考虑段落和换行符），BeautifulSoup get_text根本不会删减它。

— jrial

@jrial很高兴您发现它很有用，也感谢您的贡献。对于其他任何人，链接的要旨已得到很大增强。OP似乎暗示的是一种将html呈现为文本的工具，非常类似于基于文本的浏览器（如lynx）。这就是该解决方案的尝试。大多数人贡献的只是文本提取器。

— racitup

1

在Python 3.x中，您可以通过导入'imaplib'和'email'包以一种非常简单的方式进行操作。尽管这是一篇较旧的文章，但也许我的回答可以对新来者有所帮助。

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

现在，您可以打印主体变量，它将以纯文本格式:)如果对您足够好，那么最好将其选择为可接受的答案。

— 瓦希卜·乌尔·哈克
source

这不会转换任何东西。

— Antti Haapala '10

1

这显示text/plain了如果有人将电子邮件放在其中，则如何从其中提取其中的一部分。它不会做任何将HTML转换为纯文本的操作，如果您尝试从某个网站转换HTML，则不会做任何远程有用的操作。

— 人间

1

您可以使用BeautifulSoup从HTML中仅提取文本

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

— 西贡皮N
source

1

虽然很多人提到使用正则表达式剥离html标签，但也有很多缺点。

例如：

<p>hello&nbsp;world</p>I love you

应该解析为：

Hello world
I love you

这是我想出的一个片段，您可以根据自己的特定需求对其进行自定义，它就像一个魅力

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

— 乌里·戈伦（Uri Goren）
source

1

在Python 2.7.9+中使用BeautifulSoup4的另一个示例

包括：

import urllib2
from bs4 import BeautifulSoup

码：

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

解释：

使用html（使用BeautifulSoup）读取url数据，删除所有脚本和样式元素，并使用.get_text（）仅获取文本。分成几行，删除每行的前导和尾随空格，然后将多标题分成每行=（phrase.strip（）表示行中的行，line.split（“”））中的行的行。然后使用text ='\ n'.join，删除空白行，最后返回经批准的utf-8。

笔记：

由于SSL问题，在此系统上运行的某些系统将因https：//连接而失败，您可以关闭验证以解决该问题。修复示例：http : //blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python <2.7.9可能在运行此问题
text.encode（'utf-8'）可能会留下怪异的编码，可能只想返回str（text）即可。

— 迈克·Q
source

0

这是我定期使用的代码。

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

希望对您有所帮助。

— troymyname00
source

0

LibreOffice作者的评论很有价值，因为该应用程序可以使用python宏。它似乎为回答这个问题和扩展LibreOffice的宏基础都提供了多种好处。如果此解决方案是一次性的实现，而不是用作更大的生产程序的一部分，则在writer中打开HTML并将页面另存为文本似乎可以解决此处讨论的问题。

— 1之7
source

0

Perl的方式（对不起的妈妈，我永远不会在生产中这样做）。

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

— 布伦克尔
source

这是一种不好的做法，原因有很多，例如 

— Uri Goren

是! 这是真的！那就不要做！

— brunql