如何将大文件添加到存档并并行删除

8

假设我/root/bigfile在100GB的系统上有80GB的文件，并想将此文件放入存档中 /root/bigarchive.tar

我显然需要在添加到存档中的同时删除该文件。因此，我的问题是：

如何在添加文件的同时删除文件？

— 用户123456
source

0

如果使用的是GNU tar命令，则可以使用以下--remove-files选项：

-删除文件

将文件添加到存档后删除文件

tar -cvf files.tar --remove-files my_directory

— 达巴比
source

5

我认为OP希望在存档的同时删除文件，因此，如果--remove-files在将文件添加到.tar后删除，则对他没有帮助，因为他的硬盘会耗尽空间。

— Zumo de Vidrio

6

单个文件的未压缩tar归档文件由标题，文件和尾随垫组成。因此，您的主要问题是如何在文件的开头添加512字节的标头。您可以仅使用标题创建所需的结果：

tar cf - bigfile | dd count=1 >bigarchive.tar

然后复制文件的前10G。为简单起见，我们假设您的dd一次可以读取/写入1Gib：

dd count=10 bs=1G if=bigfile >>bigarchive.tar

现在，我们从原始文件中释放复制的数据：

fallocate --punch-hole -o 0 -l 10GiB bigfile

这将用稀疏零替换数据，该稀疏零在文件系统上不占空间。以这种方式继续，skip=10向下一个添加a dd，然后将fallocate起始偏移量增加到-o 10GiB。最后，添加一些nul字符以填充最终的tar文件。

如果文件系统不支持fallocate，则可以执行类似的操作，但是从文件末尾开始。首先，将文件的最后10Gibyte复制到一个称为的中间文件中part8。然后使用truncate命令减小原始文件的大小。进行类似的操作，直到每个10Gi文件中有8个文件为止。然后，您可以将标头和串联part1到bigarchive.tar，然后除去part1，然后串联part2并除去它，依此类推。

— u
source

5

删除文件并不一定会按照您认为的去做。这就是为什么在类似UNIX的系统中unlink，不调用系统调用的原因delete。从手册页：

unlink() deletes a name from the filesystem.  If that name was the last
link to a file and no processes have the file open, the file is deleted
and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have
the file open, the file will remain in existence until  the  last  file
descriptor referring to it is closed.

结果，只要数据压缩器/归档器正在从文件读取，该文件就仍然存在，并占据文件系统中的空间。

— 亚历克斯
source

1

如何在添加文件的同时删除文件？

给定上下文，我将这个问题解释为：

如何在读取完文件之后，读取完整文件之前立即从磁盘中删除数据，以便为转换后的文件留出足够的空间。

转换可以是您要对数据进行的任何操作：压缩，加密等。

答案是这样的：

<$file gzip | dd bs=$buffer iflag=fullblock of=$file conv=notrunc

简而言之：读取数据，将其放入gzip（或任何您想使用的数据），缓冲输出，以便我们确保读取的内容多于写入的内容，并将其写回到文件中。这是一个更漂亮的版本，并在运行时显示输出：

cat "$file" \
| pv -cN 'bytes read from file' \
| gzip \
| pv -cN 'bytes received from compressor' \
| dd bs=$buffer iflag=fullblock 2>/dev/null \
| pv -cN 'bytes written back to file' \
| dd of="$file" conv=notrunc 2>/dev/null

我将逐行介绍：

cat "$file"读取要压缩的文件。cat（UUOC）的使用是无用的，因为下一部分pv也可以读取文件，但是我觉得这更漂亮。

它通过管道将pv其显示进度信息（-cN告诉它“使用某种[c] ursor”并给它一个[N] ame）。

gzip显然在其中进行压缩的那个管道（从stdin读取，输出到stdout）。

该管道进入另一个pv管道视图。

那进入了dd bs=$buffer iflag=fullblock。该$buffer变量是一个数字，像50兆字节。但是，您要使用大量RAM来安全地处理文件（作为数据点，用于2GB文件的50MB缓冲区就可以了）。该iflag=fullblock指令告诉dd您$buffer在通过管道之前最多读取字节。首先，gzip将编写一个标头，因此gzip的输出将进入此dd行。然后dd将等待直到它具有足够的数据，然后再通过管道进行传递，以便可以进一步读取输入。此外，如果您具有不可压缩的部分，则输出文件可能大于输入文件。该缓冲区可确保$buffer这不是问题（最多字节）。

然后我们进入另一个管道视图线，最后进入我们的输出dd线。该行具有of（输出文件）并已conv=notrunc指定，在此行notrunc告诉dd不要在写入之前截断（删除）输出文件。所以，如果你有500个字节的A和你写3个字节B，该文件将BBBAAAAA...（而不是被取代的BBB）。

我没有介绍这些2>/dev/null部分，它们是不必要的。他们只是通过抑制显示dd“我完成并写了这么多字节”消息来整理输出。每行（\）末尾的反斜杠使bash将整个内容视为一个相互传递的大命令。

这是完整的脚本，易于使用。有趣的是，我将其放在一个名为“ gz-in-place”的文件夹中。然后，我意识到了我的缩写：GZIP：gnu zip就地。因此，我在此提出GZIP.sh：

#!/usr/bin/env bash

### Settings

# Buffer is how many bytes to buffer before writing back to the original file.
# It is meant to prevent the gzip header from overwriting data, and in case
# there are parts that are uncompressible where the compressor might exceed
# the original filesize. In these cases, the buffer will help prevent damage.
buffer=$((1024*1024*50)) # 50 MiB

# You will need something that can work in stream mode from stdin to stdout.
compressor="gzip"

# For gzip, you might want to pass -9 for better compression. The default is
# (typically?) 6.
compressorargs=""

### End of settings

# FYI I'm aware of the UUOC but it's prettier this way

if [ $# -ne 1 ] || [ "x$1" == "x-h" ] || [ "x$1" == "x--help" ]; then
    cat << EOF
Usage: $0 filename
Where 'filename' is the file to compress in-place.

NO GUARANTEES ARE GIVEN THAT THIS WILL WORK!
Only operate on data that you have backups of.
(But you always back up important data anyway, right?)

See the source for more settings, such as buffer size (more is safer) and
compression level.

The only non-standard dependency is pv, though you could take it out
with no adverse effects, other than having no info about progress.
EOF
    exit 1;
fi;

b=$(($buffer/1024/1024));
echo "Progressing '$1' with ${b}MiB buffer...";
echo "Note: I have no means of detecting this, but if you see the 'bytes read from";
echo "file' exceed 'bytes written back to file', your file is now garbage.";
echo "";

cat "$1" \
| pv -cN 'bytes read from file' \
| $compressor $compressorargs \
| pv -cN 'bytes received from compressor' \
| dd bs=$buffer iflag=fullblock 2>/dev/null \
| pv -cN 'bytes written back to file' \
| dd of="$1" conv=notrunc 2>/dev/null

echo "Done!";

我想在 gzip 之前添加另一条缓冲行，以防止在缓冲dd行刷新时写入过多，但是只有50MiB缓冲区和1900MB /dev/urandom数据，它似乎已经可以工作了（解压缩后匹配的md5sums）。对我来说足够好了。

另一个改进是可以检测到书写距离过长，但是我看不出如何在不消除事物美感和增加很多复杂性的情况下做到这一点。到那时，您也可以使它成为一个功能完善的python程序（可以使用故障保护功能防止数据破坏），并且可以正确执行所有操作。

— 卢克
source