如何获取网页内容并将其保存到字符串变量中

75

如何使用ASP.NET获取网页的内容？我需要编写一个程序来获取网页的HTML并将其存储到字符串变量中。

c# asp.net screen-scraping

— kamiar3001
source

116

您可以使用WebClient

Using System.Net;
    
WebClient client = new WebClient();
string downloadString = client.DownloadString("http://www.gooogle.com");

— Dhinesh
source

不幸的是，DownloadString（从.NET 3.5开始）不够智能，无法使用BOM。我在回答中加入了另一种选择。

— user2246674

13

没有投票，因为没有使用（WebClient client = new WebClient（））{} :)

— DavidKarlaš13年

3

这相当于史蒂文·斯皮尔伯格（Steven Spielberg）在3分钟前发布的答案，因此没有+1。

— BalinKingOfMoria恢复CM 2015年

72

我以前遇到过Webclient.Downloadstring的问题。如果这样做，则可以尝试以下操作：

WebRequest request = WebRequest.Create("http://www.google.com");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
    html = sr.ReadToEnd();
}

— 史考特
source

6

您能详细说明您遇到的问题吗？

— 格雷格，2010年

17

@Greg，这是一个与性能相关的问题。我从来没有真正解决过它，但是WebClient.DownloadString需要5到10秒钟才能拉下HTML，而WebRequest / WebResponse几乎是即时的。只是想提出另一个替代解决方案，以防OP出现类似问题或希望对请求/响应进行更多控制。

— 斯科特2010年

7

@Scott-+1以找到它。只需运行一些测试。DownloadString在首次使用时花费了更长的时间（5299毫秒的下载字符串与200毫秒的WebRequest）。在50 x BBC，50 x CNN和50 x其他RSS feed Urls上循环测试，使用不同的Urls避免缓存。初始加载后，BBC的DownloadString速度快了20毫秒，CNN的速度快了300毫秒。对于其他RSS提要，WebRequest的速度要快3毫秒。通常，我认为我将使用WebRequest进行单打，并使用DownloadString进行URL循环。

— HockeyJ

4

这对我来说效果很好，谢谢！为了节省其他用户的搜索量，WebRequest位于System.Net中，而Stream位于System.Io中

— Eric Barr 2014年

1

斯科特（Scott），@ HockeyJ-我不知道自从您使用WebClient以来发生了什么变化，但是当我对其进行测试（使用.NET 4.5.2）时，它的速度足够快-950毫秒（仍比花费450毫秒的单个WebRequest慢一点，但肯定不是5-10秒）。

— BornToCode

27

我建议不要使用WebClient.DownloadString。这是因为（至少在.NET 3.5中）DownloadString不够灵巧，无法使用/删除BOM（如果存在）。这可能导致ï»¿返回UTF-8数据时，BOM（）错误地显示为字符串的一部分（至少没有字符集）-！

取而代之的是，此细微变化将与BOM一起正常使用：

string ReadTextFromUrl(string url) {
    // WebClient is still convenient
    // Assume UTF8, but detect BOM - could also honor response charset I suppose
    using (var client = new WebClient())
    using (var stream = client.OpenRead(url))
    using (var textReader = new StreamReader(stream, Encoding.UTF8, true)) {
        return textReader.ReadToEnd();
    }
}

— 用户名
source

提交错误报告

— JoelFan

12

Webclient client = new Webclient();
string content = client.DownloadString(url);

传递您想要获取的页面的URL。您可以使用htmlagilitypack解析结果。

— 月桂树
source