Boto3从S3存储桶下载所有文件


82

我正在使用boto3从s3存储桶获取文件。我需要类似的功能aws s3 sync

我当前的代码是

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

只要存储桶中只有文件,就可以正常工作。如果存储桶中存在文件夹,则抛出错误

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

这是使用boto3下载完整的s3存储桶的正确方法吗?如何下载文件夹。


Answers:


34

当使用具有1000多个对象的存储桶时,实现一个解决方案(使用NextContinuationToken最多1000个键的顺序集)是必要的。该解决方案首先编译对象列表,然后迭代创建指定目录并下载现有对象。

import boto3
import os

s3_client = boto3.client('s3')

def download_dir(prefix, local, bucket, client=s3_client):
    """
    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    """
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)

将其更改为可接受的答案,因为它可以处理更广泛的用例。感谢格兰特

我的代码在while next_token is not None:
gpd

@gpd这不应该发生,因为当boto3客户端到达最后一页时,它将返回一个没有NextContinuationToken的页面,并退出while语句。如果您粘贴使用boto3 API获得的最后一个响应(无论存储在响应变量中),那么我认为在您的特定情况下会更加清楚。尝试打印出“结果”变量以进行测试。我的猜测是您提供的前缀对象与存储桶中的任何内容都不匹配。你检查了吗?
Grant Langseth

请注意,您需要进行一些较小的更改才能使其与Digital Ocean一起使用。作为解释这里
大卫D.

1
使用此代码,我得到此错误:'NoneType'对象不可迭代:TypeError
NJones

76

我有相同的需求,并创建了以下函数以递归方式下载文件。

目录仅在包含文件时在本地创建。

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

该函数的调用方式为:

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')

6
我认为您不需要创建资源和客户端。我相信资源上总是有客户可用。您可以使用resource.meta.client
TheHerk '16

2
我认为应该是“ download_dir(client,resource,subdir.get('Prefix'),local,bucket)”
rm999

6
我得到一个,OSError: [Errno 21] Is a directory所以我包装对download_file的调用if not file.get('Key').endswith('/')以解决。谢谢@glefait和@Shan
user336828

5
aws s3 syncboto3库中没有等效的aws-cli命令吗?
greperror '17

8
这是dist什么
罗布·罗斯

48

Amazon S3没有文件夹/目录。它是一个平面文件结构

为了保持目录的外观,路径名存储为对象Key(文件名)的一部分。例如:

  • images/foo.jpg

在这种情况下,整个键是images/foo.jpg,而不是foo.jpg

我怀疑您的问题是boto正在返回一个名为的文件my_folder/.8Df54234,并试图将其保存到本地文件系统。但是,您的本地文件系统将该my_folder/部分解释为目录名,并且该目录在您的本地文件系统上不存在

您可以截断文件名以仅保存.8Df54234部分,或者必须在写入文件之前创建必要的目录。请注意,它可能是多层嵌套目录。

一种简单的方法是使用AWS命令行界面(CLI),它将为您完成所有这些工作,例如:

aws s3 cp --recursive s3://my_bucket_name local_folder

还有一个sync选项将仅复制新文件和修改过的文件。


1
@j我明白。但是我需要像自动创建文件夹一样aws s3 sync。是否可以在boto3中使用。
2015年

4
您必须将目录的创建作为Python代码的一部分。如果密钥包含目录(例如foo/bar.txt),则您将负责foo在调用之前创建目录()s3.download_file。这不是的自动功能boto
约翰·罗滕斯坦

在这里,S3存储桶的内容是动态的,因此我必须检查s3.list_objects(Bucket='my_bucket_name')['Contents']和过滤文件夹密钥并创建它们。

2
在玩了一段时间Boto3之后,此处列出的AWS CLI命令绝对是执行此操作的最简单方法。
AdjunctProfessorFalcon

1
@Ben请开始一个新问题,而不是对一个旧问题(2015年)提出评论。
约翰·罗滕斯坦

43
import os
import boto3

#initiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for s3_object in my_bucket.objects.all():
    # Need to split s3_object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(s3_object.key)
    my_bucket.download_file(s3_object.key, filename)

3
简洁明了,为什么不使用它呢?比所有其他解决方案都更容易理解。集合似乎在后台为您做了很多事情。
Joost

3
我想您应该首先创建所有子文件夹,以使其正常工作。
rpanai '18

2
这段代码会将所有内容放置在顶级输出目录中,无论它在S3中的嵌套深度如何。而且,如果多个文件在不同目录中具有相同的名称,则它们会相互踩踏。我认为您还需要一行:os.makedirs(path),然后下载目标应为object.key
Scott Smith,

13

我目前正在通过使用以下命令来完成任务

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
    s3_object = s3_key['Key']
    if not s3_object.endswith("/"):
        s3.download_file('bucket', s3_object, s3_object)
    else:
        import os
        if not os.path.exists(s3_object):
            os.makedirs(s3_object)

尽管确实可以,但是我不确定这样做是否好。我将其留在此处以帮助其他用户和进一步的答案,并以更好的方式实现这一目标


9

迟到总比不到好:)以前用分页器回答确实很好。但是它是递归的,您可能最终遇到了Python的递归限制。这是另一种方法,另外还要进行一些检查。

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

1
得了KeyError: 'Contents'。输入路径'/arch/R/storeincomelogs/,完整路径/arch/R/storeincomelogs/201901/01/xxx.parquet
秘银

3

我有一个解决方法,可以在同一过程中运行AWS CLI。

awscli作为python lib安装:

pip install awscli

然后定义此函数:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

执行:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

我使用了相同的想法,但是没有使用sync命令,而是简单地执行了命令aws s3 cp s3://{bucket}/{folder} {local_folder} --recursive。时间从分钟(几乎是1小时)减少到几秒钟
阿卡鲁奇

我正在使用此代码,但是出现了所有调试日志都显示的问题。我已在全局范围内声明了此操作:logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.WARNING) logger = logging.getLogger()只希望从root输出日志。有任何想法吗?
4月Polubiec

1

一次性获取所有文件是一个非常糟糕的主意,您应该分批获取。

我用来从S3提取特定文件夹(目录)的一种实现是,

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)

    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

仍然,如果您想通过CIL使用整个存储桶,如下所述,请使用@John Rotenstein

aws s3 cp --recursive s3://bucket_name download_path

0
for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

此代码将下载/tmp/目录中的内容。如果需要,可以更改目录。


0

如果要使用python调用bash脚本,这是一种将文件从S3存储桶中的文件夹加载到本地文件夹(在Linux计算机中)的简单方法:

import boto3
import subprocess
import os

###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###

# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
#     print(object_summary.key)
    FILES_NAMES.append(object_summary.key)

# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))

# 3.Time to load files in your destination folder 
for new_filename in new_filenames:
    upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)

    subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
    if subprocess_call != 0:
        print("ALERT: loading files not working correctly, please re-check new loaded files")

0

通过阅读上述解决方案以及其他网站上的内容,我得到了类似的要求并获得了帮助,我想出了以下脚本,只是想分享一下它是否可以帮助任何人。

from boto3.session import Session
import os

def sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path):    
    session = Session(aws_access_key_id=access_key_id,aws_secret_access_key=secret_access_key)
    s3 = session.resource('s3')
    your_bucket = s3.Bucket(bucket_name)
    for s3_file in your_bucket.objects.all():
        if folder in s3_file.key:
            file=os.path.join(destination_path,s3_file.key.replace('/','\\'))
            if not os.path.exists(os.path.dirname(file)):
                os.makedirs(os.path.dirname(file))
            your_bucket.download_file(s3_file.key,file)
sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path)

0

将@glefait的答案以if条件结尾,以避免出现os错误20。

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            print("Content: ",result)
            dest_pathname = os.path.join(local, file.get('Key'))
            print("Dest path: ",dest_pathname)
            if not os.path.exists(os.path.dirname(dest_pathname)):
                print("here last if")
                os.makedirs(os.path.dirname(dest_pathname))
            print("else file key: ", file.get('Key'))
            if not file.get('Key') == dist:
                print("Key not equal? ",file.get('Key'))
                resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)enter code here

0

我已经遇到这个问题已有一段时间了,而在经历过的所有不同论坛中,我都没有看到完整的端到端摘录。因此,我继续进行所有工作(自己添加一些东西),并创建了一个完整的端到端S3下载器!

这不仅会自动下载文件,而且如果S3文件位于子目录中,它将在本地存储上创建它们。在我的应用程序实例中,我需要设置权限和所有者,因此我也添加了这些权限(如果不需要,可以将其注释掉)。

这已经过测试并且可以在Docker环境(K8)中工作,但是我已经在脚本中添加了环境变量,以防万一您想在本地测试/运行它。

我希望这有助于某人寻求S3下载自动化。我也欢迎任何有关如何在需要时对其进行更好地优化的建议,信息等。

#!/usr/bin/python3
import gc
import logging
import os
import signal
import sys
import time
from datetime import datetime

import boto
from boto.exception import S3ResponseError
from pythonjsonlogger import jsonlogger

formatter = jsonlogger.JsonFormatter('%(message)%(levelname)%(name)%(asctime)%(filename)%(lineno)%(funcName)')

json_handler_out = logging.StreamHandler()
json_handler_out.setFormatter(formatter)

#Manual Testing Variables If Needed
#os.environ["DOWNLOAD_LOCATION_PATH"] = "some_path"
#os.environ["BUCKET_NAME"] = "some_bucket"
#os.environ["AWS_ACCESS_KEY"] = "some_access_key"
#os.environ["AWS_SECRET_KEY"] = "some_secret"
#os.environ["LOG_LEVEL_SELECTOR"] = "DEBUG, INFO, or ERROR"

#Setting Log Level Test
logger = logging.getLogger('json')
logger.addHandler(json_handler_out)
logger_levels = {
    'ERROR' : logging.ERROR,
    'INFO' : logging.INFO,
    'DEBUG' : logging.DEBUG
}
logger_level_selector = os.environ["LOG_LEVEL_SELECTOR"]
logger.setLevel(logger_level_selector)

#Getting Date/Time
now = datetime.now()
logger.info("Current date and time : ")
logger.info(now.strftime("%Y-%m-%d %H:%M:%S"))

#Establishing S3 Variables and Download Location
download_location_path = os.environ["DOWNLOAD_LOCATION_PATH"]
bucket_name = os.environ["BUCKET_NAME"]
aws_access_key_id = os.environ["AWS_ACCESS_KEY"]
aws_access_secret_key = os.environ["AWS_SECRET_KEY"]
logger.debug("Bucket: %s" % bucket_name)
logger.debug("Key: %s" % aws_access_key_id)
logger.debug("Secret: %s" % aws_access_secret_key)
logger.debug("Download location path: %s" % download_location_path)

#Creating Download Directory
if not os.path.exists(download_location_path):
    logger.info("Making download directory")
    os.makedirs(download_location_path)

#Signal Hooks are fun
class GracefulKiller:
    kill_now = False
    def __init__(self):
        signal.signal(signal.SIGINT, self.exit_gracefully)
        signal.signal(signal.SIGTERM, self.exit_gracefully)
    def exit_gracefully(self, signum, frame):
        self.kill_now = True

#Downloading from S3 Bucket
def download_s3_bucket():
    conn = boto.connect_s3(aws_access_key_id, aws_access_secret_key)
    logger.debug("Connection established: ")
    bucket = conn.get_bucket(bucket_name)
    logger.debug("Bucket: %s" % str(bucket))
    bucket_list = bucket.list()
#    logger.info("Number of items to download: {0}".format(len(bucket_list)))

    for s3_item in bucket_list:
        key_string = str(s3_item.key)
        logger.debug("S3 Bucket Item to download: %s" % key_string)
        s3_path = download_location_path + "/" + key_string
        logger.debug("Downloading to: %s" % s3_path)
        local_dir = os.path.dirname(s3_path)

        if not os.path.exists(local_dir):
            logger.info("Local directory doesn't exist, creating it... %s" % local_dir)
            os.makedirs(local_dir)
            logger.info("Updating local directory permissions to %s" % local_dir)
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(local_dir, 0o775)
            os.chown(local_dir, 60001, 60001)
        logger.debug("Local directory for download: %s" % local_dir)
        try:
            logger.info("Downloading File: %s" % key_string)
            s3_item.get_contents_to_filename(s3_path)
            logger.info("Successfully downloaded File: %s" % s3_path)
            #Updating Permissions
            logger.info("Updating Permissions for %s" % str(s3_path))
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(s3_path, 0o664)
            os.chown(s3_path, 60001, 60001)
        except (OSError, S3ResponseError) as e:
            logger.error("Fatal error in s3_item.get_contents_to_filename", exc_info=True)
            # logger.error("Exception in file download from S3: {}".format(e))
            continue
        logger.info("Deleting %s from S3 Bucket" % str(s3_item.key))
        s3_item.delete()

def main():
    killer = GracefulKiller()
    while not killer.kill_now:
        logger.info("Checking for new files on S3 to download...")
        download_s3_bucket()
        logger.info("Done checking for new files, will check in 120s...")
        gc.collect()
        sys.stdout.flush()
        time.sleep(120)
if __name__ == '__main__':
    main()

0

从AWS S3 Docs(如何使用S3存储桶中的文件夹?):

在Amazon S3中,存储桶和对象是主要资源,并且对象存储在存储桶中。Amazon S3具有扁平的结构,而不是像文件系统中那样的层次结构。但是,为了简化组织,Amazon S3控制台支持文件夹概念作为对对象进行分组的一种方式。Amazon S3通过为对象使用共享名称前缀来实现此目的(也就是说,对象的名称以公共字符串开头)。对象名称也称为键名称。

例如,您可以在控制台上创建一个名为photos的文件夹,并在其中存储一个名为myphoto.jpg的对象。然后将该对象与键名photos / myphoto.jpg一起存储,其中photos /是前缀。

将所有文件从“ mybucket”下载到当前目录中,并遵循存储桶的仿真目录结构(如果本地尚不存在,则从存储桶中创建文件夹):

import boto3
import os

bucket_name = "mybucket"
s3 = boto3.client("s3")
objects = s3.list_objects(Bucket = bucket_name)["Contents"]
for s3_object in objects:
    s3_key = s3_object["Key"]
    path, filename = os.path.split(s3_key)
    if len(path) != 0 and not os.path.exists(path):
        os.makedirs(path)
    if not s3_key.endswith("/"):
        download_to = path + '/' + filename if path else filename
        s3.download_file(bucket_name, s3_key, download_to)

如果可以包含一些代码解释会更好。
约翰

1
@johan,感谢您的反馈!我添加了相关的解释
Daria
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.