您如何使用各种语言和库来解析HTML?
回答时:
个别评论将链接到有关如何使用正则表达式解析HTML的问题的答案,以显示正确的处理方式。
为了保持一致性,我要求示例为href
in锚标记解析HTML文件。为了便于搜索此问题,请您遵循以下格式
语言:[语言名称]
图书馆:[图书馆名称]
[example code]
请使库成为库文档的链接。如果您要提供除提取链接以外的示例,还请包括:
目的:[解析做什么]
您如何使用各种语言和库来解析HTML?
回答时:
个别评论将链接到有关如何使用正则表达式解析HTML的问题的答案,以显示正确的处理方式。
为了保持一致性,我要求示例为href
in锚标记解析HTML文件。为了便于搜索此问题,请您遵循以下格式
语言:[语言名称]
图书馆:[图书馆名称]
[example code]
请使库成为库文档的链接。如果您要提供除提取链接以外的示例,还请包括:
目的:[解析做什么]
Answers:
语言:JavaScript
库:jQuery
$.each($('a[href]'), function(){
console.debug(this.href);
});
(使用firebug console.debug输出...)
并加载任何html页面:
$.get('http://stackoverflow.com/', function(page){
$(page).find('a[href]').each(function(){
console.debug(this.href);
});
});
为此使用了另一个每个函数,我认为在链接方法时它更干净。
语言:C#
库:HtmlAgilityPack
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
}
语言:Python
库:BeautifulSoup
from BeautifulSoup import BeautifulSoup
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links
输出:
[<a href="http://foo.com">foo</a>,
<a href="http://bar.com">bar</a>,
<a href="http://baz.com">baz</a>]
也可能:
for link in links:
print link['href']
输出:
http://foo.com
http://bar.com
http://baz.com
语言:Perl
库:pQuery
use strict;
use warnings;
use pQuery;
my $html = join '',
"<html><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
pQuery( $html )->find( 'a' )->each(
sub {
my $at = $_->getAttribute( 'href' );
print "$at\n" if defined $at;
}
);
语言:shell
库:lynx(嗯,这不是库,但是在shell中,每个程序都是同类库)
lynx -dump -listonly http://news.google.com/
语言:Python
库:HTMLParser
#!/usr/bin/python
from HTMLParser import HTMLParser
class FindLinks(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
at = dict(attrs)
if tag == 'a' and 'href' in at:
print at['href']
find = FindLinks()
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
find.feed(html)
语言:Perl
库:HTML :: Parser
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $find_links = HTML::Parser->new(
start_h => [
sub {
my ($tag, $attr) = @_;
if ($tag eq 'a' and exists $attr->{href}) {
print "$attr->{href}\n";
}
},
"tag, attr"
]
);
my $html = join '',
"<html><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
$find_links->parse($html);
语言Perl
库:HTML :: LinkExtor
Perl的优点是您具有用于特定任务的模块。喜欢链接提取。
整个程序:
#!/usr/bin/perl -w
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my $url = 'http://www.google.com/';
my $content = get( $url );
my $p = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );
exit;
sub process_link {
my ( $tag, %attr ) = @_;
return unless $tag eq 'a';
return unless defined $attr{ 'href' };
print "- $attr{'href'}\n";
return;
}
说明:
就这样。
语言:Common Lisp
库:Closure Html,Closure Xml,CL-WHO
(显示为使用DOM API,未使用XPATH或STP API)
(defvar *html*
(who:with-html-output-to-string (stream)
(:html
(:body (loop
for site in (list "foo" "bar" "baz")
do (who:htm (:a :href (format nil "http://~A.com/" site))))))))
(defvar *dom*
(chtml:parse *html* (cxml-dom:make-dom-builder)))
(loop
for tag across (dom:get-elements-by-tag-name *dom* "a")
collect (dom:get-attribute tag "href"))
=>
("http://foo.com/" "http://bar.com/" "http://baz.com/")
语言:Clojure
库: Enlive(用于Clojure的基于选择器的(àla CSS)模板和转换系统)
选择器表达式:
(def test-select
(html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
现在我们可以在REPL上执行以下操作(我在中添加了换行符test-select
):
user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
{:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
{:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")
您需要执行以下操作才能尝试:
前言:
(require '[net.cgrand.enlive-html :as html])
测试HTML:
(def test-html
(apply str (concat ["<html><body>"]
(for [link ["foo" "bar" "baz"]]
(str "<a href=\"http://" link ".com/\">" link "</a>"))
["</body></html>"])))
语言:Perl
库:XML :: Twig
#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';
use LWP::Simple;
use XML::Twig;
#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;
my $twig = XML::Twig->new();
$twig->parse_html($content);
my @hrefs = map {
$_->att('href');
} $twig->get_xpath('//*[@href]');
print "$_\n" for @hrefs;
警告:此类页面可能会出现宽字符错误(将url更改为注释掉的页面会得到此错误),但是上述HTML :: Parser解决方案却没有这个问题。
语言:Perl
库:HTML :: Parser
目的:如何使用Perl正则表达式删除未使用的嵌套HTML span标签?
在此示例中,我故意包含了格式错误和不一致的XML。
import java.io.IOException;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Parser parser = new Parser();
parser.setFeature(Parser.namespacesFeature, false);
final Builder builder = new Builder(parser);
final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
final Element root = document.getRootElement();
final Nodes links = root.query("//a[@href]");
for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
final Node node = links.get(linkNumber);
System.out.println(((Element) node).getAttributeValue("href"));
}
}
}
默认情况下,TagSoup将引用XHTML的XML名称空间添加到文档中。在这个示例中,我选择隐藏它。使用默认行为将要求调用root.query
包括如下名称空间:
root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
语言:C#
库:System.XML(标准.NET)
using System.Collections.Generic;
using System.Xml;
public static void Main(string[] args)
{
List<string> matches = new List<string>();
XmlDocument xd = new XmlDocument();
xd.LoadXml("<html>...</html>");
FindHrefs(xd.FirstChild, matches);
}
static void FindHrefs(XmlNode xn, List<string> matches)
{
if (xn.Attributes != null && xn.Attributes["href"] != null)
matches.Add(xn.Attributes["href"].InnerXml);
foreach (XmlNode child in xn.ChildNodes)
FindHrefs(child, matches);
}
语言:JavaScript
库:DOM
var links = document.links;
for(var i in links){
var href = links[i].href;
if(href != null) console.debug(href);
}
(使用firebug console.debug输出...)
语言:球拍
库:(planet ashinn / html-parser:1)和(planet clements / sxml2:1)
(require net/url
(planet ashinn/html-parser:1)
(planet clements/sxml2:1))
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))
上面的示例使用了新软件包系统中的软件包:html-parsing和sxml
(require net/url
html-parsing
sxml)
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))
注意:从命令行使用“ raco”安装所需的软件包,方法是:
raco pkg install html-parsing
和:
raco pkg install sxml
语言:Python
库:lxml.html
import lxml.html
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
if attribute == "href":
print link
lxml还有一个用于遍历DOM的CSS选择器类,它可以使它的使用与使用JQuery非常相似:
for a in tree.cssselect('a[href]'):
print a.get('href')
语言:Objective-C
库:libxml2 + Matt Gallagher的libxml2包装器+ Ben Copsey的ASIHTTPRequest
ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
NSData *response = [request responseData];
NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
[request release];
}
else
@throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];
...
- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
if (nodes != nil)
return nodes;
return nil;
}
语言:Perl
库:HTML :: TreeBuilder
use strict;
use HTML::TreeBuilder;
use LWP::Simple;
my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;
for my $a ($document->find('a')) {
print $a->attr('href'), "\n" if $a->attr('href');
}
语言:Ruby
库:Nokogiri
#!/usr/bin/env ruby
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }
puts hrefs
哪个输出:
/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana@iana.org?subject=General%20website%20feedback
这是对上面的一个小调整,导致输出可用于报告。我只返回href列表中的第一个和最后一个元素:
#!/usr/bin/env ruby
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }
puts hrefs
.each_with_index # add an array index
.minmax{ |a,b| a.last <=> b.last } # find the first and last element
.map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output
1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com
语言:Java
库:jsoup
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
final Elements links = document.select("a[href]");
for (final Element element : links) {
System.out.println(element.attr("href"));
}
}
}
语言:PHP库:DOM
<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);
$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
echo $links->item($i)->getAttribute('href'), "\n";
有时在将@
符号放在前面$doc->loadHTMLFile
以抑制无效的html解析警告很有用
使用phantomjs,将此文件另存为extract-links.js:
var page = new WebPage(),
url = 'http://www.udacity.com';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var results = page.evaluate(function() {
var list = document.querySelectorAll('a'), links = [], i;
for (i = 0; i < list.length; i++) {
links.push(list[i].href);
}
return links;
});
console.log(results.join('\n'));
}
phantom.exit();
});
跑:
$ ../path/to/bin/phantomjs extract-links.js
语言:Coldfusion 9.0.1+
图书馆: jSoup
<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
var s={};s.href= links[a].attr('href'); s.text= links[a].text();
if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s);
}
return res;
}
//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">
返回结构数组,每个结构包含一个HREF和TEXT对象。
语言:JavaScript / Node.js
var request = require('request');
var cheerio = require('cheerio');
var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
var anchorTags = $('a');
anchorTags.each(function(i,element){
console.log(element["attribs"]["href"]);
});
}
});
请求库下载html文档,而Cheerio则允许您使用jquery css选择器来定位html文档。