使用Python解析HTML

185

我正在寻找适用于Python的HTML Parser模块，该模块可以帮助我以Python列表/字典/对象的形式获取标签。

如果我有以下格式的文件：

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

那么它应该给我一种通过HTML标签的名称或ID访问嵌套标签的方法，这样我基本上可以要求它为我div提供class='container'包含在body标签中或类似标签的标签中的内容/文本。

如果您使用了Firefox的“检查元素”功能（查看HTML），您就会知道它以一种很好的嵌套方式（如树）为您提供了所有标签。

我更喜欢一个内置模块，但是可能要求太多。

我在Stack Overflow上遇到了很多问题，在互联网上也有一些博客，其中大多数都建议使用BeautifulSoup或lxml或HTMLParser，但是其中很少有详细介绍功能，最后只是争论哪个更快/更有效。

python xml-parsing html-parsing

— 雏鸟
source

2

像所有其他答复者一样，我建议使用BeautifulSoup，因为它非常擅长处理损坏的HTML文件。

— Pascal Rosin

195

这样我就可以要求它为我获取div标签中的内容/文本，其中body =包含class ='container'或类似内容。

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

我猜您不需要性能描述-只需阅读BeautifulSoup的工作原理即可。查看其官方文档。

— Aadaam
source

2

parsed_html对象到底是什么？

— 2012年

1

parsed_html是BeautifulSoup对象，将其视为DOMElement或DOMDocument，除了它具有“棘手的”属性外，例如“ body”将引用第一个（在这种情况下）BeautifulSoup对象（请记住，它基本上是一个树节点）（仅）根元素的body元素（在我们的示例中为html）

— Aadaam

18

只是更新：从BeautifulSoup 4开始，进口产品线现在为from bs4 import BeautifulSoup

— Bailey Parker

2

一般信息：如果性能至关重要，请改用该lxml库（请参见下面的答案）。有了cssselect它，它也非常有用，性能通常比其他可用库好10到100倍。

— Lenar Hoyt 2014年

注意：class属性是特殊的：BeautifulSoup(html).find('div', 'container').text

— jfs

85

我猜你在找什么pyquery：

pyquery：类似jQuery的python库。

您想要的示例可能像：

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

并且它使用与Firefox或Chrome的inspect元素相同的选择器。例如：

元素选择器为“ div＃mw-head.noprint”

被检查的元素选择器是“ div＃mw-head.noprint”。因此，在pyquery中，您只需要传递此选择器：

pq('div#mw-head.noprint')

— 柚三美
source

2

我为此爱你3000！

— progyammer

41

在这里，您可以了解有关Python中不同HTML解析器及其性能的更多信息。即使文章有些陈旧，它仍然可以为您提供很好的概述。

Python HTML解析器性能

即使它不是内置的，我也建议使用BeautifulSoup。只是因为它很容易处理这些任务。例如：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

— au
source

2

我一直在寻找细节，而不是性能/效率。编辑：对不起，过早的回答，该链接实际上很好。谢谢。

— 2012年

第一类点列表总结了其功能和特点：）

— Qiau 2012年

5

如果您使用BeautifulSoup4（最新版本）：from bs4 import BeautifulSoup

— Franck Dernoncourt，2014年

29

与其他解析器库相比，lxml速度非常快：

而且，cssselect它也非常容易用于抓取HTML页面：

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html文档

— 莱纳尔·霍伊特（Lenar Hoyt）
source

不支持HTTPS

— 塞尔吉奥

@Sergio使用import requests，将缓冲区保存到文件：stackoverflow.com/a/14114741/1518921（或urllib），使用解析加载保存的文件后，doc = parse('localfile.html').getroot()

— Guilherme Nascimento

我解析大量HTML以获得特定数据。用BeautifulSoup完成它花了几1.7秒钟，但是应用lxml却使它提高了将近*100两倍！如果关心性能，lxml是最佳选择

— Alex-Bogdanov

9

我建议使用lxml解析HTML。请参阅“解析HTML”（在lxml网站上）。

以我的经验，Beautiful Soup将一些复杂的HTML弄乱了。我相信这是因为Beautiful Soup不是解析器，而是非常好的字符串分析器。

— 爱情与和平-Joe Codeswell
source

3

AIUI Beautiful Soup可以与大多数“后端” XML解析器一起使用，lxml似乎是受支持的解析器之一crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

— 2014年

@ffledgling BeautifulSoup的某些功能很慢。

— Lenar Hoyt 2014年

2

我建议使用justext库：

https://github.com/miso-belica/jusText

用法： Python2：

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3：

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

— 卫珊娜
source

0

我会使用EHP

https://github.com/iogf/ehp

这里是：

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

输出：

Something here
Something else

— 无名战士
source

5

请解释。在流行的BeautifulSoup或lxml上，您将如何使用EHP？

— ChaimG '16