如何按类别查找元素

386

我在使用Beautifulsoup解析具有“ class”属性的HTML元素时遇到了麻烦。代码看起来像这样

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

脚本完成后的同一行出现错误。

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

我如何摆脱这个错误？

— 新
source

646

您可以使用BS3优化搜索以仅找到具有给定类的那些div：

mydivs = soup.findAll("div", {"class": "stylelistrow"})

— 克劳斯·比斯科夫·佩德森
source

@ Klaus-如果我想改用findAll怎么办？

1

谢谢你它不仅适用于@class，而且适用于任何东西。

— prageeth

41

这仅适用于完全匹配。<.. class="stylelistrow">匹配但不匹配<.. class="stylelistrow button">。

— 2014年

4

@pyCthon请参阅@jmunsch的答案，BS现在支持class_正常工作。

— 2014年

25

从beautifulsoup4开始，findAll现在是find_all

— Neoecos

273

从文档中：

从Beautiful Soup 4.1.2开始，您可以使用关键字arguments通过CSS类进行搜索 class_：

soup.find_all("a", class_="sister")

在这种情况下将是：

soup.find_all("div", class_="stylelistrow")

它也适用于：

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

— 蒙施
source

5

您也可以使用列表：soup.find_all("a", ["stylelistrowone", "stylelistrow"])如果您没有很多课程，这会更安全。

— 努诺·安德烈

4

这应该是公认的答案，它比其他选择更正确，更简洁。

— goncalopp

1

@NunoAndré对BeautifulSoup 3的回答的补充soup.findAll("a", {'class':['stylelistrowone', 'stylelistrow']})。

— 布拉德，

55

更新：2016在最新版本的beautifulsoup中，方法“ findAll”已重命名为“ find_all”。链接到官方文档

因此答案将是

soup.find_all("html_element", class_="your_class_name")

— 霸王
source

18

特定于BeautifulSoup 3：

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

将找到所有这些：

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

— 倒装
source

为什么不re.search（'。* stylelistrow。*'，x）？

— rjurney 2015年

因为然后stylelistrow2将匹配。更好的评论是“为什么不使用string.find（）而不是re？”

— FlipMcF 2015年

2

lambda x: 'stylelistrow' in x.split()简单而美丽

— fferri

我讨厌正则表达式。谢谢！（更新答案）| 保留“ x和”以测试是否没有

— -FlipMcF

16

直接的方法是：

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

确保您使用findAll的大小写，而不是findall的大小写

— 康纳克·莫迪（Konark Modi）
source

4

这仅适用于完全匹配。<.. class="stylelistrow">匹配但不匹配<.. class="stylelistrow button">。

— 2014年

11

如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的html元素时遇到了麻烦。

您可以轻松地按一个类别查找，但是如果要按两个类别的相交查找，则要困难一些，

从文档（添加重点）：

如果要搜索与两个或多个 CSS类匹配的标签，则应使用CSS选择器：
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

为了清楚起见，此操作仅选择既是删除线又是正文类的p标签。

要在一组类中查找任何交集（不是交集，而是联合），可以给class_关键字参数提供一个列表（从4.1.2开始）：

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list)

还要注意，findAll已从camelCase重命名为更多Pythonic find_all。

— 亚伦·霍尔
source

11

CSS选择器

单班第一场比赛

soup.select_one('.stylelistrow')

比赛清单

soup.select('.stylelistrow')

复合类（即AND另一类）

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

复合类名称中的空格例如class = stylelistrow otherclassname用“。”代替。您可以继续添加类。

类列表（或-匹配存在的任何一个

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

innerText包含字符串的特定类

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

具有特定子元素（例如a标签）的特定类

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

— QHarr
source

5

从BeautifulSoup 4+开始，

如果您只有一个类名，则只需将类名作为参数传递即可：

mydivs = soup.find_all('div', 'class_name')

或者，如果您有多个类名，只需将类名列表作为参数传递即可：

mydivs = soup.find_all('div', ['class1', 'class2'])

— 湿婆莎
source

3

尝试首先检查div是否具有class属性，如下所示：

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

— 喵
source

1

那不行我想您的方法是正确的，但是第4行无法按预期工作。

— 新

1

啊，我以为div像字典一样工作，我对Beautiful Soup并不真正熟悉，所以这只是一个猜测。

— Mew

3

这对我来说可以访问class属性（在beautifulsoup 4上，与文档中所说的相反）。KeyError会返回一个列表，而不是字典。

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

— 斯蒂格兹
source

3

以下对我有用

a_tag = soup.find_all("div",class_='full tabpublist')

— Preetham DP
source

1

这为我工作：

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

— 拉里·西姆斯（Larry Symms）
source

1

或者，我们可以使用lxml，它支持xpath并且非常快！

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

— 索汉·达斯（Sohan Das）
source

0

这应该工作：

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

— 蓝天
source

0

其他答案对我不起作用。

在其他答案中，findAll它被用于汤对象本身，但是我需要一种方法，可以通过对从我做完之后获得的对象中提取的特定元素内的对象进行类名查找findAll。

如果您要在嵌套的HTML元素中进行搜索以按类名获取对象，请尝试以下操作-

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

注意事项：

我没有明确定义要在'class'属性上进行搜索findAll("li", {"class": "song_item"})，因为它是我要搜索的唯一属性，并且如果您不专门指出要在哪个属性上查找，默认情况下它将搜索class属性。
当您执行findAllor或时find，生成的对象属于的bs4.element.ResultSet子类list。您可以ResultSet在任意数量的嵌套元素（只要类型为ResultSet）内利用的所有方法进行查找或全部查找。
我的BS4版本-4.9.1，Python版本-3.8.1

— ZeroFlex
source

0

以下应该工作

soup.find('span', attrs={'class':'totalcount'})

将“ totalcount”替换为您的班级名称，并将“ span”替换为您要查找的标签。另外，如果您的班级包含多个带空格的名称，只需选择一个并使用即可。

PS这将找到具有给定条件的第一个元素。如果要查找所有元素，则将“ find”替换为“ find_all”。

— 哈里·苏丹
source