测试BeautifulSoup中的标签中是否存在属性

74

我想获取<script>文档中的所有标签，然后根据某些属性的存在（或不存在）来处理每个标签。

例如，对于每个<script>标签，如果属性for存在，则执行一些操作；否则，如果bar存在该属性，则执行其他操作。

这是我目前正在做的事情：

outputDoc = BeautifulSoup(''.join(output))
scriptTags = outputDoc.findAll('script', attrs = {'for' : True})

但是这样我过滤了所有<script>带有for属性的标签...但是我丢失了其他标签（没有for属性的标签）。

python beautifulsoup

— LB40
source

1

“但是if ... in不起作用”？这意味着什么？语法错误？您所说的“无效”是什么意思？请非常具体地说明问题所在。

— S.Lott

您是否要测试任何标签，所有标签中是否存在属性，还是要分别处理每次出现的标签？

— 2011年

111

如果我理解得很好，您只需要所有脚本标记，然后检查其中的某些属性？

scriptTags = outputDoc.findAll('script')
for script in scriptTags:
    if script.has_attr('some_attribute'):
        do_something()

— 卢卡斯（Lucas S.）
source

我无法执行以下操作：如果在脚本中使用“ some_attribute”？，这就是我所追求的，并且我想避免一次又一次地调用findAll ...

— LB40 2011年

5

要检查可用属性，您必须使用python dict方法，例如：script.has_key（'some_attribute'）

— Lucas S.

1

如何检查标签是否具有任何属性？虽然tag.has_key（'some_attribute'）工作正常，但tag.keys（）引发异常（'NoneType'对象不可调用）。

— Georg Pfolz 2013年

12

请更新此帖子，已弃用has_key。请改用has_attr。

— RvdK 2014年

3

可悲的是，没有为我工作。也许这种方式soup_response.find('err').string is not None也可以用于其他属性...

— im_infamous，

32

供将来参考，beautifulsoup 4已弃用has_key。现在您需要使用has_attr

scriptTags = outputDoc.findAll('script')
  for script in scriptTags:
    if script.has_attr('some_attribute'):
      do_something()

— 弥亚
source

english.stackexchange.com/questions/45295/...

— Gallaecio

@gallaecio固定。

— miah

@ e-info128我认为您可能需要将代码作为一个问题发布

— miah

@ e-info128我遇到了同样的错误。你有没有找到解决办法？

— Harshil Doshi

31

您不需要任何lambda即可按属性过滤，只需some_attribute=True在find或中使用即可find_all。

script_tags = soup.find_all('script', some_attribute=True)

# or

script_tags = soup.find_all('script', {"some-data-attribute": True})

以下是其他方法的更多示例：

soup = bs4.BeautifulSoup(html)

# Find all with a specific attribute

tags = soup.find_all(src=True)
tags = soup.select("[src]")

# Find all meta with either name or http-equiv attribute.

soup.select("meta[name],meta[http-equiv]")

# find any tags with any name or source attribute.

soup.select("[name], [src]")

# find first/any script with a src attribute.

tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")

# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")

# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")

# find all tags with a name attribute that endwith foo
# or any src that ends with  whatever
soup.select("[name$=foo], [src$=whatever]")

您还可以将正则表达式与find或find_all一起使用：

import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with 
soup.find_all("script", src=re.compile("whatever$"))

— 帕德拉克·坎宁安
source

我同意这应该是公认的答案。我简化了主要示例以使其更加突出。

— mihow

17

如果只需要获取带有属性的标签，则可以使用lambda：

soup = bs4.BeautifulSoup(YOUR_CONTENT)

具有属性的标签

tags = soup.find_all(lambda tag: 'src' in tag.attrs)

要么

tags = soup.find_all(lambda tag: tag.has_attr('src'))

具有属性的特定标签

tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)

等等...

认为可能有用。

— 来宾
source

1

优雅的解决方案！

— Andor

3

您可以检查是否存在某些属性

scriptTags = outputDoc.findAll（'script'，some_attribute = True）
用于scriptTags中的脚本：
    做点什么（）

— 马查理
source

1

通过使用pprint模块，您可以检查元素的内容。

from pprint import pprint

pprint(vars(element))

在bs4元素上使用此命令将打印类似于以下内容的内容：

{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
 'can_be_empty_element': False,
 'contents': [u'\n\t\t\t\tNESNA\n\t'],
 'hidden': False,
 'name': u'span',
 'namespace': None,
 'next_element': u'\n\t\t\t\tNESNA\n\t',
 'next_sibling': u'\n',
 'parent': <h1 class="pie-compoundheader" itemprop="name">\n<span class="pie-description">Bedside table</span>\n<span class="pie-productname size-3 name global-name">\n\t\t\t\tNESNA\n\t</span>\n</h1>,
 'parser_class': <class 'bs4.BeautifulSoup'>,
 'prefix': None,
 'previous_element': u'\n',
 'previous_sibling': u'\n'}

要访问一个属性（可以说是类列表），请使用以下命令：

class_list = element.attrs.get('class', [])

您可以使用以下方法过滤元素：

for script in soup.find_all('script'):
    if script.attrs.get('for'):
        # ... Has 'for' attr
    elif "myClass" in script.attrs.get('class', []):
        # ... Has class "myClass"
    else: 
        # ... Do something else

— 亚当·萨尔玛
source