使用wget递归获取目录中包含任意文件的目录

573

我有一个Web目录，我在其中存储一些配置文件。我想使用wget下拉这些文件并保持其当前结构。例如，远程目录如下所示：

http://mysite.com/configs/.vim/

.vim包含多个文件和目录。我想使用wget在客户端上复制它。似乎找不到正确的wget标志组合来完成此操作。有任何想法吗？

shell wget

— 耶罗德桑托
source

986

您必须将-np/ --no-parent选项传递给wget（当然-r/ --recursive除外），否则它将遵循我网站上目录索引中指向父目录的链接。因此，命令如下所示：

wget --recursive --no-parent http://example.com/configs/.vim/

为避免下载自动生成的index.html文件，请使用-R/ --reject选项：

wget -r -np -R "index.html*" http://example.com/configs/.vim/

— 杰里米·鲁滕
source

52

添加-nH（切出主机名）--cut-dirs = X（切出X目录）。这是一个有点恼人必须手动计数X目录..

— lkraav

3

为什么这些都不对w3.org/History/1991-WWW-NeXT/Implementation有用？它将仅下载robots.txt

— matteo

31

@matteo，因为robots.txt可能不允许抓取该网站。您应该添加-e robots = off强制爬网。

— 令人惊讶的2014年

添加-X / absolute / path / to / folder以排除特定目录

— vishnu narayanan

3

如果您不想下载全部内容，则可以使用：-l1仅下载目录（在您的情况下为example.com）-l2下载目录和所有1级子文件夹（“ example.com/something”，但不能下载） 'example.com/somthing/foo'），依此类推。如果不插入-l选项，则wget将自动使用-l 5。如果插入-l 0，则将下载整个Internet，因为wget将跟踪它找到的每个链接。stackoverflow.com/a/19695143/6785908

— so-random-dude

123

要递归下载目录，该目录将拒绝index.html *文件并在没有主机名，父目录和整个目录结构的情况下进行下载：

wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data

— 斯里拉姆
source

我不能得到这个工作：wget的-r -nH --Cut -迪尔斯= 3 --no父母--reject = “index.html的*” w3.org/History/1991-WWW-NeXT/Implementation - -cut-dirs = 2也不起作用，它仅下载实际上位于根文件夹中的robots.txt。我还想念什么？

— matteo

34

@matteo尝试添加：-e robots = off

— Paul J

要递归获取目录中的所有目录，请使用wget -r -nH --reject =“ index.html *” mysite.io:1234/dir1/dir2

— Prasanth Ganesan

115

对于其他有类似问题的人。Wget跟随robots.txt，这可能使您无法获取该网站。不用担心，您可以将其关闭：

wget -e robots=off http://www.example.com/

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html

— 肖恩·比利亚尼
source

当您忽略robots.txt时，您至少应该限制您的请求。此答案中建议的行为是高度不礼貌的。

— 没人

@Nobody那么对此有礼貌的回答是什么？

— Phani Rithvij

@PhaniRithvij限制您的请求速率，wget具有参数。请注意，有些人可能仍然会遇到问题，考虑到robots文件明确告诉您不允许其执行当前操作，您甚至可能会遇到法律麻烦。

— 没人在

37

您应使用-m（镜像）标志，因为该标志应避免弄乱时间戳并无限期递归。

wget -m http://example.com/configs/.vim/

如果在此线程中添加其他人提到的要点，则将是：

wget -m -e robots=off --no-parent http://example.com/configs/.vim/

— 山姆·古迪
source

34

这是对我有用的完整wget命令，用于从服务器目录下载文件（忽略robots.txt）：

wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/

— 埃里希·艾辛格
source

8

如果--no-parent没有帮助，则可以使用--includeoption。

目录结构：

http://<host>/downloads/good
http://<host>/downloads/bad

而您想下载downloads/good但不想要downloads/bad目录：

wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good

5

wget -r http://mysite.com/configs/.vim/

为我工作。

也许您有一个干扰它的.wgetrc？

— 康纳·麦克德莫特
source

5

要使用用户名和密码递归获取目录，请使用以下命令：

wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/

— 祈祷
source

2

Wget 1.18可能会更好，例如，我被版本1.12的错误所咬伤，其中...

wget --recursive (...)

...仅检索index.html而不是所有文件。

解决方法是注意到一些301重定向并尝试新的位置-给定新的URL，wget在目录中获取了所有文件。

— 德文郡
source

2

您只需要两个标志，一个标志"-r"用于递归和"--no-parent"（或-np），以便不进入'.'and ".." 。像这样：

wget -r --no-parent http://example.com/configs/.vim/

而已。它将下载到以下本地树中：./example.com/configs/.vim。但是，如果您不希望前两个目录，请使用--cut-dirs=2前面的答复中建议的其他标志：

wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/

而且它只会将您的文件树下载到 ./.vim/

实际上，我正是从wget手册中获得了这一答案的第一行，在4.3节末尾，他们有一个非常干净的示例。

— 乔丹·吉
source

2

在处理递归下载时，以下选项似乎是完美的组合：

wget -nd -np -P / dest / dir-递归http：// url / dir1 / dir2

为了方便起见，手册页中的相关片段：

   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving recursively.  With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
       filenames will get extensions .n).


   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

— 原始人
source

1

您只需添加-r就可以做到这一点

wget -r http://stackoverflow.com/

— 卡巴斯基
source

9

这实际上并不会下载目录，而是会在服务器上找到所有文件，包括您要下载的目录之上的目录。

— 2013年

1

此版本以递归方式下载，并且不会创建父目录。

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

用法：

添加~/.bashrc或粘贴到终端
wgetod "http://example.com/x/"

— 尔科克
source