wget-如何递归下载且仅特定的mime类型/扩展名（即仅文本）

22

如何下载完整的网站，但忽略所有二进制文件。

wget具有使用-r标志的此功能，但它下载了所有内容，并且某些网站对于资源匮乏的计算机而言实在太多了，并且由于我正在下载网站的特定原因而没有用。

这是我使用的命令行：（wget -P 20 -r -l 0 http://www.omardo.com/blog我自己的博客）

— 奥马尔·伊萨维（Omar Al-Ithawi）
source

1

wget只能使用文件后缀过滤

— 雏菊2012年

@ warl0ck我不知道，谢谢！-A和-R选项对我的操作非常有用。

— 奥马尔·伊萨维

21

您可以指定允许的响应列表。不允许的文件名模式：

允许：

-A LIST
--accept LIST

不允许：

-R LIST
--reject LIST

LIST 是文件名模式/扩展名的逗号分隔列表。

您可以使用以下保留字符来指定模式：

*
?
[
]

例子：

仅下载PNG文件： -A png
不要下载CSS文件： -R css
不要下载以“头像”开头的PNG文件： -R avatar*.png

如果文件没有扩展名。文件名没有可以使用的模式，我想您需要MIME类型解析（请参阅Lars Kotthoffs answer）。

— Unor
source

2

您可以尝试对此进行 wget修补（也在此处），以按MIME类型进行过滤。但是，此修补程序现在已经很旧了，因此可能不再起作用。

— 拉尔斯·科特霍夫（Lars Kotthoff）
source

试一试 ... ftp.gnu.org/gnu/wget我掷骰子只是在用最新的wget补丁对此进行了修补，但是没有运气（当然）。我会尝试更新补丁，但坦率地说，我还没有在C ++中获得印章，因为它不是一个时间消耗。我确实设法获取了为它编写的wget版本并使其运行。尽管在使用ssl支持进行编译时遇到了麻烦，因为我无法弄清楚需要获取哪个版本的openssl。

— MageProspero

这看起来很棒。你知道为什么这个补丁还没有被接受（四年后）吗？

— David Portabella

2

一个新的Wget（Wget2）已经具有功能：

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2截至今天尚未发布，但很快就会发布。Debian不稳定版已经发布了Alpha版本。

查看https://gitlab.com/gnuwget/wget2了解更多信息。您可以将问题/评论直接发送到bug-wget@gnu.org。

— 蒂姆·鲁森Rockdaboot
source

1

我尝试了一种完全不同的方法来使用Scrapy，但是它有同样的问题！这是我的解决方法：SO：Python Scrapy-基于mimetype的过滤器，可避免非文本文件下载？

解决方案是设置Node.js代理并配置Scrapy以通过以下方式使用它http_proxy环境变量。

什么是代理应该做的是：

从Scrapy接收HTTP请求，并将其发送到要爬网的服务器。然后，它将响应从Scrapy返回，即拦截所有HTTP通信。

对于二进制文件（基于您实施的启发式方法），它将向403 ForbiddenScrapy 发送错误，并立即关闭请求/响应。这有助于节省时间，流量和Scrapy不会崩溃。

实际有效的示例代理代码！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

— 奥马尔·伊萨维（Omar Al-Ithawi）
source