如何使用Python检索网页的页面标题？

78

如何使用Python检索网页的页面标题（标题html标签）？

python html

— 胆固醇
source

自从提出此问题以来，许多网页开始使用og：title元标记，其中包含原始标题，而<title>通常带有其他数据的前缀和后缀。最初，许多网站仅将Facebook用作OpenGraph的一部分，许多站点都在提供OpenGraph元数据。og：title已成为页面标题（尤其是新闻文章）的标准来源。

— Nicolas

64

我将始终将lxml用于此类任务。您也可以使用beautifulsoup。

import lxml.html
t = lxml.html.parse(url)
print t.find(".//title").text

根据评论进行编辑：

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print p.find(".//title").text

— 彼得·霍夫曼
source

5

万一你用上面的代码得到IO错误：stackoverflow.com/questions/3116269/...

— Yosh

1

lxml可能与Unicode有关，可以使用bs4.UnicodeDammit帮助它找到正确的字符编码

— jfs

91

这是@Vinko Vrsalovic的答案的简化版本：

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

注意：

soup.title在html文档中的任何位置找到第一个title元素
title.string假定它只有一个子节点，并且该子节点是一个字符串

对于beautifulsoup 4.x，请使用不同的导入：

from bs4 import BeautifulSoup

— f
source

7

谢谢！万一有人遇到类似的问题，在我的Python3环境中，我必须使用urlllib.request代替urllib2。不知道为什么。为了避免有关解析器的BeautifulSoup警告，我必须这样做soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")。

— sudo

对于python 3，请使用import urllib.request as urllib代替import urllib2

— blueray

请注意，如果缺少标题属性或<title></title>执行中的空标题soup.title.string将返回None

— Eitanmg，

@Eitanmg：确实，repl.it

— jfs

14

机械化浏览器对象具有title（）方法。因此，从代码这个帖子可以被改写为：

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

— 编码
source

13

无需导入其他库。请求具有内置的此功能。

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

— 拉胡尔·乔瓦（Rahul Chawla）
source

11

对于这样一个简单的任务，这可能是过高的，但是如果您打算做更多的事情，那么从这些工具（机械化，BeautifulSoup）开始比较明智，因为它们比其他工具（使用urllib获取内容和进行正则表达式）更容易使用或其他解析器来解析html）

链接： BeautifulSoup 机械化

#!/usr/bin/env python
#coding:utf-8

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

— 文科·弗萨洛维奇（Vinko Vrsalovic）
source

6

使用soup.select_one定位标题标签

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

— QHarr
source

6

使用HTMLParser：

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain

— 芬恩
source

值得一提的是，该脚本适用于Python3。在Python 3.x中，HtmlParser模块已重命名为html.parser。类似地，在python 3中添加了urllib.request。–

— satishgoda

1

它可能会更好的字节数显式转换为字符串，r=urlopen(url)，encoding = r.info().get_content_charset()，和html_string = r.read().decode(encoding)。

— reubano

4

使用正则表达式

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

— 芬恩
source

.group（1）到底是什么？有参考吗？

— pije76'7

嗨，group(0)将返回整个比赛。请参阅匹配对象以供参考。

— Finn

1

这将错过所有标题标签的格式不完全符合<title> </ title>（大写，混合大小写，间距）的情况

— Luke Rehmann

如果title标记中还有其他数据，我还将包括<title。*？>。

— Pranav Wadhwa

1

soup.title.string实际上返回一个unicode字符串。要将其转换为普通字符串，您需要 string=string.encode('ascii','ignore')

— 西基里蒂·巴丹（Sai Kiriti Badam）
source

这只会删除所有可能不是您想要的非ascii字符。如果您确实想要字节（encode给出的内容）而不是字符串，请使用正确的编码charset。例如string.encode('utf-8')。

— reubano

1

这是一个容错HTMLParser实现。
您可以扔很多东西get_title()而不会破坏它，如果发生任何意外情况， get_title()将返回None。
当Parser()下载它，它编码的页面ASCII ，无论在忽略任何错误的页面使用的字符集的。进行更改to_ascii()以将数据转换为UTF-8或任何其他编码将是微不足道的。只需添加一个编码参数并将函数重命名为即可to_encoding()。
默认情况下，HTMLParser()它将在损坏的html上中断，甚至在不匹配的标记（例如不匹配的标记）上中断。为了防止这种行为，我将HTMLParser()的错误方法替换为将忽略错误的函数。

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)

def is_bytes(data):
    return isinstance(data, bytes)

def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data


class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True

    def handle_data(self, data):
        if self.rec:
            self.title = data

    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False


def get_title(url):
    return Parser(url).title

print(get_title('http://www.google.com'))

— 里奇·威尔逊
source

0

使用lxml ...

从根据Facebook opengraph协议标记的页面meta中获取它：

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

或将.xpath与lxml一起使用：

t = html_doc.xpath(".//title")[0].text

— 标记
source