用wget递归下载

32

我的以下wget命令有问题：

wget -nd -r -l 10 http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

它应该以递归方式下载原始Web上的所有链接文档，但仅下载两个文件（index.html和robots.txt）。

如何实现该网站的递归下载？

wget

— 拉尔夫
source

40

wget默认情况下，就像搜索引擎一样，它遵循robots.txt标准来抓取页面，而对于archive.org，则它禁止整个/ web /子目录。要覆盖，请使用-e robots=off，

wget -nd -r -l 10 -e robots=off http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

— 乌尔里希·施瓦兹（Ulrich Schwarz）
source

谢谢。是否可以将每个链接仅存储一次？也许我应该减少10到更低的数字，但这很难猜测。现在有一个文件introduction.html，introduction.html.1，introduction.html.2和我宁愿结束的过程。

— xralf 2011年

并且链接直接指向网络。--mirror链接是否可以选择直接指向文件系统？

— xralf 2011年

1

@xralf：好吧，您正在使用-nd，因此将不同的index.htmls放在同一目录中，如果没有-k，您将无法重写链接。

— Ulrich Schwarz，

12

$ wget --random-wait -r -p -e robots=off -U Mozilla \
    http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

递归下载URL的内容。

--random-wait - wait between 0.5 to 1.5 seconds between requests.
-r - turn on recursive retrieving.
-e robots=off - ignore robots.txt.
-U Mozilla - set the "User-Agent" header to "Mozilla". Though a better choice is a real User-Agent like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)".

其他一些有用的选项是：

--limit-rate=20k - limits download speed to 20kbps.
-o logfile.txt - log the downloads.
-l 0 - remove recursion depth (which is 5 by default).
--wait=1h - be sneaky, download one file every hour.

— 尼基尔·穆利（Nikhil Mulley）
source

-l 0 - remove recursion depth (which is 5 by default)+1

— 达尼（Dani）