下载网页的本地工作副本[关闭]


211

我想下载网页的本地副本,并获取所有的CSS,图像,JavaScript等。

在先前的讨论中(例如,这里这里,都有两年多的历史了),通常提出两个建议:wget -phttrack。但是,这些建议都失败了。使用这些工具之一来完成任务,我将非常感谢。替代品也很可爱。


选项1: wget -p

wget -p成功下载了网页的所有必备组件(css,图像,js)。但是,当我在Web浏览器中加载本地副本时,该页面无法加载先决条件,因为尚未从Web版本上修改这些先决条件的路径。

例如:

  • 在页面的html中,<link rel="stylesheet href="https://stackoverflow.com/stylesheets/foo.css" />将需要更正以指向的新的相对路径foo.css
  • 在css文件中,background-image: url(/images/bar.png)将同样需要进行调整。

有没有一种方法可以wget -p使路径正确?


选项2:httrack

httrack似乎是一个用于镜像整个网站的好工具,但是我不清楚如何使用它来创建单个页面的本地副本。httrack论坛上有很多关于此主题的讨论(例如,此处),但似乎没有人提供防弹解决方案。


选项3:另一种工具?

有人建议使用付费工具,但我简直不敢相信那里没有免费的解决方案。


19
如果答案不起作用,请尝试:wget -E -H -k -K -p http://example.com-仅此方法对我有用。信用:superuser.com/a/136335/94039
its_me 2013年

还有一些软件可以做到这一点,即Teleport Pro
pbies,2016年

3
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
davidcondrey

Answers:


262

wget可以满足您的要求。只需尝试以下方法:

wget -p -k http://www.example.com/

-p会得到你所需的所有元素,以正确地查看网站(CSS,图像等)。的-k会更改所有链接(包括那些对CSS和图像),让你为它出现在网上浏览网页离线。

从Wget文档:

‘-k’
‘--convert-links’
After the download is complete, convert the links in the document to make them
suitable for local viewing. This affects not only the visible hyperlinks, but
any part of the document that links to external content, such as embedded images,
links to style sheets, hyperlinks to non-html content, etc.

Each link will be changed in one of the two ways:

    The links to files that have been downloaded by Wget will be changed to refer
    to the file they point to as a relative link.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also
    downloaded, then the link in doc.html will be modified to point to
    ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary
    combinations of directories.

    The links to files that have not been downloaded by Wget will be changed to
    include host name and absolute path of the location they point to.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to
    ../bar/img.gif), then the link in doc.html will be modified to point to
    http://hostname/bar/img.gif. 

Because of this, local browsing works reliably: if a linked file was downloaded,
the link will refer to its local name; if it was not downloaded, the link will
refer to its full Internet address rather than presenting a broken link. The fact
that the former links are converted to relative links ensures that you can move
the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been
downloaded. Because of that, the work done by ‘-k’ will be performed at the end
of all the downloads. 

2
我尝试了此操作,但是以某种方式index.html#link-to-element-on-same-page停止了内部链接等工作。
rhand


12
如果您在不使用用户代理的情况下使用wget,则某些服务器将以403代码响应,您可以添加-U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4'
nikoskip 2014年

45
如果发现仍然缺少图像等,请尝试添加以下内容:-e robots = off ..... wget实际读取并尊重robots.txt-这确实使我很难弄清为什么一无所有工作了!
约翰·亨特2014年

24
从外国主机获取资源使用-H, --span-hosts
davidhq 2015年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.