如何使用HTML Agility Pack

629

如何使用HTML Agility Pack？

我的XHTML文档不是完全有效。这就是为什么我要使用它。如何在项目中使用它？我的项目在C＃中。

c# html html-agility-pack

— 卡拉
source

79

这个问题对我很有帮助。

— BigJoe714 2010年

26

旁注：使用处理NuGet的Visual Studio，您现在可以右键单击“参考”，然后选择“管理NuGet软件包...”，搜索“ HtmlAgilityPack”，然后单击“安装”。然后直接使用using / Import语句处理代码。

— 2011年

关于@patridge的上述评论：当我首先通过ankhsvn从svn获取项目时，我发现需要删除然后重新添加对HtmlAgilityPack的引用。

— Andrew Coonce 2013年

14

任何研究HTMLAgilityPack的人都应该考虑CsQuery，根据我的经验，它是一个更新得多的库，具有更现代的界面。例如，第一个答案的整个代码可以在CsQuery中总结为var body = CQ.CreateFromFile(filePath)["body"]。

— 本杰明·格伦鲍姆

2

@BenjaminGruenbaum：赞成您的CsQuery建议-只需几分钟即可设置，非常易于使用。

— Neolisk 2014年

358

首先，将HTMLAgilityPack nuget软件包安装到您的项目中。

然后，例如：

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);

// Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
    // Handle any parse errors as required

}
else
{

    if (htmlDoc.DocumentNode != null)
    {
        HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

        if (bodyNode != null)
        {
            // Do something with bodyNode
        }
    }
}

（注意：此代码仅是示例，不一定是最佳/唯一方法。请勿在自己的应用程序中盲目使用。）

该HtmlDocument.Load()方法还接受一个流，这对于与.NET框架中的其他面向流的类进行集成非常有用。虽然HtmlEntity.DeEntitize()是正确处理html实体的另一种有用方法。（感谢马修）

HtmlDocument这HtmlNode 是您最常使用的类。与XML解析器类似，它提供了接受XPath表达式的selectSingleNode和selectNodes方法。

注意HtmlDocument.Option?????? 布尔属性。这些控制Load和LoadXML方法将如何处理HTML / XHTML。

还有一个名为HtmlAgilityPack.chm的已编译帮助文件，该文件对每个对象都有完整的引用。这通常在解决方案的基本文件夹中。

— 灰
source

11

另请注意，Load接受Stream参数，这在许多情况下都很方便。我将其用于HTTP流（WebResponse.GetResponseStream）。要注意的另一个好方法是HtmlEntity.DeEntitize（HTML Agility Pack的一部分）。在某些情况下，需要手动处理实体。

— 马修·弗拉申

1

注意：在最新版的Html Agility Pack（2009年10月3日发布的1.4.0 Beta 2）中，由于依赖于Sandcastle，DocProject和Visual Studio 2008 SDK，帮助文件已移至单独的下载中。

— rtpHarry

SelectSingleNode() 似乎已被删除

— 克里斯·S

3

不，SelectSingleNode和SelectNodes肯定还在那里。我发现它应该是htmlDoc.ParseErrors.Count（）而不是.Count有点有趣

— -Mike Blandford

1

@MikeBlandford //部分是。它似乎已在PCL版本的HtmlAgailityPack中删除（或从一开始就不存在）。nuget.org/packages/HtmlAgilityPack-PCL

— Joon Hong

166

我不知道这对您是否有帮助，但是我写了几篇文章介绍了这些基础知识。

下一篇文章完成了95％，我只需要写出我编写的代码的最后几部分的说明。如果您有兴趣，那么我会尽量记得在发布时在此处发布。

— rtpHarry
source

16

终于在两年后完成了该文章：) 使用HtmlAgilityPack检测网站中RSS和Atom提要的一种简单方法

— rtpHarry 2012年

3

最近在Code Project中，它发布了一篇很好的HTMLAgilityPack文章。您可以在这里

— Victor Sigler 2014年

64

HtmlAgilityPack使用XPath语法，尽管许多人认为它的文档记录不充分，但是在以下XPath文档的帮助下，我使用它没有问题：https : //www.w3schools.com/xml/xpath_syntax.asp

解析

<h2>
  <a href="">Jack</a>
</h2>
<ul>
  <li class="tel">
    <a href="">81 75 53 60</a>
  </li>
</ul>
<h2>
  <a href="">Roy</a>
</h2>
<ul>
  <li class="tel">
    <a href="">44 52 16 87</a>
  </li>
</ul>

我这样做：

string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
  names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[@class='tel']//a"))
{
  phones.Add(node.ChildNodes[0].InnerHtml);
}

— 肯特·芒特·卡斯珀森
source

完全正确。它完全取决于XPath标准。首先应该学习该标准，然后一切都会变得容易。

— FindOut_Quran 2015年

您提供的链接不再可用。这可能是新的：w3schools.com/xsl/xpath_syntax.asp

— Piotrek，2016年

我在DocumentNode对象中也看不到任何SelectNodes（）函数。它重命名了吗？

— Piotrek '16

您使用的是什么版本，以及从哪里下载的？根据htmlagilitypack.codeplex.com/SourceControl/latest#Release/1_4_0/…，HtmlNode类上应该有一个SelectNodes方法。

— 肯特·芒特·卡斯珀森

链接不可用，新链接：www.w3schools.com/xml/xpath_syntax.asp

— Tyrmos

6

与HTMLAgilityPack相关的主要代码如下

using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace GetMetaData
{
    /// <summary>
    /// Summary description for MetaDataWebService
    /// </summary>
    [WebService(Namespace = "http://tempuri.org/")]
    [WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
    [System.ComponentModel.ToolboxItem(false)]
    // To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
    [System.Web.Script.Services.ScriptService]
    public class MetaDataWebService: System.Web.Services.WebService
    {
        [WebMethod]
        [ScriptMethod(UseHttpGet = false)]
        public MetaData GetMetaData(string url)
        {
            MetaData objMetaData = new MetaData();

            //Get Title
            WebClient client = new WebClient();
            string sourceUrl = client.DownloadString(url);

            objMetaData.PageTitle = Regex.Match(sourceUrl, @
            "\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

            //Method to get Meta Tags
            objMetaData.MetaDescription = GetMetaDescription(url);
            return objMetaData;
        }

        private string GetMetaDescription(string url)
        {
            string description = string.Empty;

            //Get Meta Tags
            var webGet = new HtmlWeb();
            var document = webGet.Load(url);
            var metaTags = document.DocumentNode.SelectNodes("//meta");

            if (metaTags != null)
            {
                foreach(var tag in metaTags)
                {
                    if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
                    {
                        description = tag.Attributes["content"].Value;
                    }
                }
            } 
            else
            {
                description = string.Empty;
            }
            return description;
        }
    }
}

— 船长
source

4

该网站不再可用

— Dimitar Tsonev

5

    public string HtmlAgi(string url, string key)
    {

        var Webget = new HtmlWeb();
        var doc = Webget.Load(url);
        HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[@name='{0}']", key));

        if (ourNode != null)
        {


                return ourNode.GetAttributeValue("content", "");

        }
        else
        {
            return "not fount";
        }

    }

— 易卜拉欣·奥兹波鲁克
source

0

入门-HTML Agility Pack

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

— 梅萨姆
source

0

尝试这个

string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());

private string ParseHmlBody(string html)
        {
            string body = string.Empty;
            try
            {
                var htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(html);
                var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
                body = htmlBody.OuterHtml;
            }
            catch (Exception ex)
            {

                dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
            }
            return body;
        }

— PK-1825
source