BeautifulSoup抓取可见网页文本

124

基本上，我想使用BeautifulSoup来严格抓取网页上的可见文本。例如，此网页是我的测试用例。我主要想获取正文文本（文章），甚至在这里和那里甚至几个标签名称。我尝试了这个SO问题中的建议，该建议返回很多<script>我不想要的标签和html注释。我无法弄清楚该函数所需的参数findAll()，以便仅获取网页上的可见文本。

那么，我应该如何查找除脚本，注释，CSS等之外的所有可见文本？

— 用户名
source

239

试试这个：

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

— 博博
source

47

+1 soup.findAll(text=True)从未得知该功能

— Hartley Brody 2012年

7

对于最新的BS4（至少），您可以标识注释isinstance(element, Comment)而不是与正则表达式匹配。

— 2013年

5

我认为第2行应该是soup = BeautifulSoup(html)

— jczaplew

11

在可见函数中，用于查找注释的省略号似乎不起作用。我不得不将其更新为elif isinstance(element,bs4.element.Comment):。我也将“元”添加到了父母名单中。

— Russ Savage 2015年

4

上面的过滤器的结果中有很多\ n，添加以下代码以消除空格和elif re.match(r"[\s\r\n]+",str(element)): return False

— 换

37

@jbochi批准的答案对我不起作用。str（）函数调用会引发异常，因为它无法对BeautifulSoup元素中的非ASCII字符进行编码。这是将示例网页过滤为可见文本的一种更为简洁的方法。

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

— 极客
source

1

如果str(element)由于编码问题unicode(element)而失败，那么如果您使用的是Python 2 ，则应该尝试使用

— 。– mknaf

31

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

— 包金
source

4

先前的答案对我不起作用，但确实如此:)

— rjurney

如果我在url imfuna.com上尝试此操作，尽管页面上有更多文本/单词，但它仅返回6个单词（Imfuna财产清单和检查应用程序）...为什么此答案对此不起作用的任何想法网址？@bumpkin

— the_t_test_1

10

我完全尊重使用Beautiful Soup获取呈现的内容，但是它可能不是获取页面上呈现的内容的理想软件包。

我遇到了类似的问题，无法获取渲染的内容或典型浏览器中的可见内容。特别是，在下面的一个简单示例中，我可能有许多非典型案例。在这种情况下，不可显示的标签嵌套在样式标签中，在我检查过的许多浏览器中都不可见。存在其他变体，例如将类标签设置显示定义为无。然后将此类用于div。

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

上面发布的一种解决方案是：

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

该解决方案当然在许多情况下都有应用程序，并且通常可以很好地完成工作，但是在上面发布的html中，它保留了未呈现的文本。经过搜索之后，这里出现了一些解决方案，BeautifulSoup get_text不会剥离所有标签和JavaScript ，这里是使用Python将HTML渲染为纯文本的方式

我尝试了这两种解决方案：html2text和nltk.clean_html，并且对计时结果感到惊讶，因此认为它们值得后代的答案。当然，速度很大程度上取决于数据的内容。

@Helge的一个答案是关于使用nltk的所有东西。

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

返回带有呈现的html的字符串的效果很好。这个nltk模块甚至比html2text还要快，尽管html2text可能更健壮。

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

— 保罗
source

3

如果您关心性能，这是另一种更有效的方法：

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings是一个迭代器，它返回，NavigableString以便您可以直接检查父级的标记名，而无需经历多个循环。

— 波兰啤酒
source

2

标题位于<nyt_headline>标签内，该标签嵌套在<h1>标签和<div>ID为“ article” 的标签内。

soup.findAll('nyt_headline', limit=1)

应该管用。

文章正文位于<nyt_text>标记内，该标记嵌套在<div>ID为“ articleBody” 的标记内。在<nyt_text> 元素内部，文本本身包含在<p> 标签中。图片不在这些<p>标签内。对我来说，尝试语法很难，但是我希望工作的草稿看起来像这样。

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

— 伊万·托德
source

我确定这适用于此测试用例，但是正在寻找可以应用于其他各种网站的更通用的答案...到目前为止，我已经尝试使用正则表达式来查找<script> </ script>标记和< ！-。*->注释并将其替换为“”，但出于总和的原因，这甚至证明有点困难

— 。–

2

虽然，我会完全建议一般使用精美的汤，但是，如果有人希望显示格式错误的html的可见部分（例如，您只有网页的一段或一行），无论出于何种原因，以下内容将删除<和>标签之间的内容：

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

— 凯里尼亚
source

2

使用BeautifulSoup是最简单的方法，只需较少的代码即可获取字符串，而不会出现空行和废话。

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

— 迭戈·苏亚雷斯（Diego Suarez）
source

0

处理这种情况的最简单方法是使用getattr()。您可以根据需要调整此示例：

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

如果存在，它将"3.7"在标记对象中找到文本元素，<span class="ratingsContent">3.7</span>但是默认为NoneType不存在时。

getattr(object, name[, default])

返回对象的命名属性的值。名称必须是字符串。如果字符串是对象属性之一的名称，则结果是该属性的值。例如，getattr（x，'foobar'）等同于x.foobar。如果命名属性不存在，则返回默认值（如果提供），否则引发AttributeError。

— 戴维·耶灵顿
source

0

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))

— 坎兰·考萨
source