您如何以编程方式下载Java网页

116

我希望能够获取网页的html并将其保存到String，因此可以对其进行一些处理。另外，我该如何处理各种类型的压缩。

我将如何使用Java做到这一点？

java http compression

— 金吉
source

这基本上是stackoverflow.com/questions/921262/…

— 罗宾·格林

110

这是一些使用Java的URL类的经过测试的代码。我建议比在这里处理异常或将异常传递到调用堆栈方面做得更好。

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

— 比尔蜥蜴
source

16

不推荐使用DataInputStream.readLine（），但除了非常好的示例之外。我使用包裹在BufferedReader（）中的InputStreamReader（）来获取readLine（）函数。

— mjh2007 '02

2

这没有考虑到字符编码，因此，尽管它似乎适用于ASCII文本，但如果不匹配，最终将导致“奇怪的字符”。

— artbristol 2012年

在第三行中将替换DataInputStream为BufferedReader。并替换"dis = new DataInputStream(new BufferedInputStream(is));"为"dis = new BufferedReader(new InputStreamReader(is));"

— kolobok，

1

@akapelko谢谢。我更新了答案，以删除对不赞成使用的方法的调用。

— 比尔蜥蜴

2

关闭该InputStreamReader怎么办？

— 亚历山大-恢复莫妮卡

170

我会使用像Jsoup这样的不错的HTML解析器。然后就这么简单：

String html = Jsoup.connect("http://stackoverflow.com").get().html();

它完全透明地处理GZIP和分块响应以及字符编码。它还提供了更多优势，例如HTML 遍历和CSS选择器操作（如jQuery一样）。您只需将其作为Document，而不是作为String。

Document document = Jsoup.connect("http://google.com").get();

您真的不想在HTML上运行基本的String方法甚至regex来处理它。

也可以看看：

Java中领先的HTML解析器的优缺点是什么？

— BalusC
source

3

好答案。有一点晚。;)

— jjnguy 2011年

59

总比没有好。

— BalusC

很棒的图书馆:)谢谢。

— Jakub P.

为什么以前没有人告诉我有关.html（）的信息。我非常努力地研究如何轻松存储由Jsoup提取的html，这很有帮助。

— 阿瓦曼德

对于新手，如果您在android中使用此库，则需要在其他线程中使用它，因为默认情况下它在同一应用程序线程上运行，这将导致应用程序抛出NetworkOnMainThreadException

— Mohammed Elrashied

24

Bill的回答很好，但是您可能需要对请求做一些事情，例如压缩或用户代理。以下代码显示了如何对请求进行各种类型的压缩。

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

要同时设置用户代理，请添加以下代码：

conn.setRequestProperty ( "User-agent", "my agent name");

— 金吉
source

对于那些希望将InputStream转换为字符串的人，请参见此答案。

— SSight3

12

好的，您可以使用内置库，例如URL和URLConnection，但是它们没有提供太多控制权。

~~我个人将使用Apache HTTPClient库。~~
编辑： HTTPClient已被Apache 设置为寿命终止。替换为：HTTP组件

— 乔恩·斯基特
source

没有Java版本的System.Net.WebRequest？

— FlySwat

1

有点像URL。:-)例如：new URL（“ google.com”）。openStream（） // => InputStream

— Daniel Spiewak

1

@乔纳森：丹尼尔大部分时候说的是-尽管WebRequest比URL给您更多的控制权。IMO，HTTPClient在功能上更为接近。

— 乔恩·斯基特

9

上面提到的所有方法都不会像在浏览器中那样下载网页文本。如今，大量数据通过html页面中的脚本加载到浏览器中。上述技术均不支持脚本，它们仅下载html文本。HTMLUNIT支持javascript。因此，如果您要下载浏览器中显示的网页文本，则应使用HTMLUNIT。

— 用户名
source

1

您很可能需要从安全网页（https协议）中提取代码。在以下示例中，该html文件被保存到c：\ temp \ filename.html尽情享受！

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

— 质量检查专员
source

0

在Unix / Linux机器上，您可以只运行“ wget”，但是如果您要编写跨平台客户端，这并不是一个选择。当然，这是假定您真的不希望在下载数据到命中磁盘之间对下载的数据做太多事情。

— 蒂莫·格什
source

我也会从这种方法开始，如果不够的话，以后再进行重构

— Dustin Getz

0

Jetty有一个HTTP客户端，可用于下载网页。

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();

            String url = "http://www.something.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

该示例打印一个简单网页的内容。

在阅读Java网页教程中，我编写了六个示例，使用URL，JSoup，HtmlCleaner，Apache HttpClient，Jetty HttpClient和HtmlUnit以Java编程方式下载网页。

— 扬·博德纳
source

0

从此类获得帮助，它获取代码并过滤一些信息。

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}

— Sohaib Aslam
source

0

为此，请使用功能强大的NIO.2 Files.copy（InputStream in，路径目标）：

URL url = new URL( "http://download.me/" );
Files.copy( url.openStream(), Paths.get("downloaded.html" ) );

— 扬·提巴尔
source

-1

我使用了该帖子的实际答案（url），并将输出写入文件中。

package test;

import java.net.*;
import java.io.*;

public class PDFTest {
    public static void main(String[] args) throws Exception {
    try {
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\\a_01\\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            writer.println(inputLine);
        }
        in.close();
        } catch(Exception e) {

        }

    }
}

— A_01
source