152

我知道Internet上图像的URL。

例如http://www.digimouth.com/news/media/2011/09/google-logo.jpg，其中包含Google的徽标。

现在，如何使用Python下载此图像，而无需在浏览器中实际打开URL并手动保存文件。

python web-scraping

— Pankaj Vatsa
source

1

如何使用Python通过HTTP下载文件的

— Jaydev

316

Python 2

如果您要做的只是将其保存为文件，这是一种更简单的方法：

import urllib

urllib.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")

第二个参数是应在其中保存文件的本地路径。

Python 3

正如SergO所建议的，以下代码应与Python 3配合使用。

import urllib.request

urllib.request.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")

— Liquid_Fire
source

55

从链接获取文件名的一种好方法是filename = link.split('/')[-1]

— heltonbiker

2

使用urlretrieve我只能得到一个1KB的文件，里面有字典和404错误文本。为什么？如果我在浏览器中输入url，就可以得到图片

— Yebach 2014年

2

@Yebach：您从中下载的站点可能正在使用Cookie，User-Agent或其他标头来确定为您服务的内容。在您的浏览器和Python之间，这些将有所不同。

— Liquid_Fire 2014年

27

Python 3： import urllib.request和urllib.request.urlretrieve()，相应地。

— SergO '16

1

@SergO-您可以在原始答案中添加Python 3部分吗？

— Sreejith Menon

27

import urllib
resource = urllib.urlopen("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")
output = open("file01.jpg","wb")
output.write(resource.read())
output.close()

file01.jpg 将包含您的图像。

— 努法尔·易卜拉欣
source

2

您应该以二进制模式打开文件：open("file01.jpg", "wb")否则可能会损坏图像。

— Liquid_Fire 2011年

2

urllib.urlretrieve可以直接保存图像。

— heltonbiker

17

我编写了一个脚本来执行此操作，并且可以在我的github上找到该脚本供您使用。

我利用BeautifulSoup允许我解析任何网站的图像。如果您要进行大量的网络抓取（或打算使用我的工具），建议您使用sudo pip install BeautifulSoup。可在此处获得有关BeautifulSoup的信息。

为了方便起见，这是我的代码：

from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib

# use this image scraper from the location that 
#you want to save scraped images to

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + "images found.")
    print 'Downloading images to current working directory.'
    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        filename=each.split('/')[-1]
        urllib.urlretrieve(each, filename)
    return image_links

#a standard call looks like this
#get_images('http://www.wookmark.com')

— 对。
source

11

这可以通过请求来完成。加载页面并将二进制内容转储到文件中。

import os
import requests

url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg'
page = requests.get(url)

f_ext = os.path.splitext(url)[-1]
f_name = 'img{}'.format(f_ext)
with open(f_name, 'wb') as f:
    f.write(page.content)

— 亚历克斯
source

1

如果请求变坏，则请求中的用户标头:)

— 1UC1F3R616

8

Python 3

urllib.request —用于打开URL的可扩展库

from urllib.error import HTTPError
from urllib.request import urlretrieve

try:
    urlretrieve(image_url, image_local_path)
except FileNotFoundError as err:
    print(err)   # something wrong with local path
except HTTPError as err:
    print(err)  # something wrong with url

— SergO
source

6

适用于Python 2和Python 3的解决方案

try:
    from urllib.request import urlretrieve  # Python 3
except ImportError:
    from urllib import urlretrieve  # Python 2

url = "http://www.digimouth.com/news/media/2011/09/google-logo.jpg"
urlretrieve(url, "local-filename.jpg")

或者，如果的附加要求requests是可以接受的并且是http（s）URL：

def load_requests(source_url, sink_path):
    """
    Load a file from an URL (e.g. http).

    Parameters
    ----------
    source_url : str
        Where to load the file from.
    sink_path : str
        Where the loaded file is stored.
    """
    import requests
    r = requests.get(source_url, stream=True)
    if r.status_code == 200:
        with open(sink_path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

— 马丁·托马
source

5

我在Yup。的脚本上扩展了脚本。我修好了一些东西。现在它将绕过403：禁止的问题。当无法检索图像时，它不会崩溃。它试图避免损坏预览。它获取正确的绝对URL。它给出了更多信息。可以使用命令行中的参数来运行它。

# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are

from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import sys
import time

def make_soup(url):
    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib2.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print 'Downloading images to current working directory.'
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print 'Getting: ' + filename
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print '  An error occured. Continuing.'
    print 'Done.'

if __name__ == '__main__':
    url = sys.argv[1]
    get_images(url)

— 疯狂道具
source

3

使用请求库

import requests
import shutil,os

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder

def ImageDl(url):
    attempts = 0
    while attempts < 5:#retry 5 times
        try:
            filename = url.split('/')[-1]
            r = requests.get(url,headers=headers,stream=True,timeout=5)
            if r.status_code == 200:
                with open(os.path.join(path,filename),'wb') as f:
                    r.raw.decode_content = True
                    shutil.copyfileobj(r.raw,f)
            print(filename)
            break
        except Exception as e:
            attempts+=1
            print(e)


ImageDl(url)

— 索恩·达斯（Sohan Das）
source

在我看来，标头确实很重要，我收到403错误。有效。

— Ishtiyaq Husain

2

这是很简短的答案。

import urllib
urllib.urlretrieve("http://photogallery.sandesh.com/Picture.aspx?AlubumId=422040", "Abc.jpg")

— OO7
source

2

Python 3版本

我为Python 3调整了@madprops的代码

# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are

from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time

def make_soup(url):
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib.request.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print('Downloading images to current working directory.')
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print('Getting: ' + filename)
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print('  An error occured. Continuing.')
    print('Done.')

if __name__ == '__main__':
    get_images('http://www.wookmark.com')

— 乔凡尼·PY
source

1

使用Requests对于Python 3来说有些新鲜：

代码中的注释。准备使用功能。


import requests
from os import path

def get_image(image_url):
    """
    Get image based on url.
    :return: Image name if everything OK, False otherwise
    """
    image_name = path.split(image_url)[1]
    try:
        image = requests.get(image_url)
    except OSError:  # Little too wide, but work OK, no additional imports needed. Catch all conection problems
        return False
    if image.status_code == 200:  # we could have retrieved error page
        base_dir = path.join(path.dirname(path.realpath(__file__)), "images") # Use your own path or "" to use current working directory. Folder must exist.
        with open(path.join(base_dir, image_name), "wb") as f:
            f.write(image.content)
        return image_name

get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")

— 帕维尔·潘乔查（PavelPančocha）
source

0

较晚的答案，但是python>=3.6您可以使用dload，即：

import dload
dload.save("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")

如果您需要使用图像bytes，请使用：

img_bytes = dload.bytes("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")

使用安装 pip3 install dload

— CONvid19
source

-2

img_data=requests.get('https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg')

with open(str('file_name.jpg', 'wb') as handler:
    handler.write(img_data)

— 刘易斯·曼
source

4

欢迎使用Stack Overflow！尽管您可能已经解决了该用户的问题，但是仅代码的答案对以后遇到此问题的用户不是很有帮助。请编辑您的答案以解释您的代码为何能解决原始问题。

— 乔C

1

TypeError: a bytes-like object is required, not 'Response'。它必须是handler.write(img_data.content)

— TitanFighter

应该是handler.write(img_data.read())。

— jdhao

如何使用我已经知道URL地址的Python在本地保存图像？

Python 2

Python 3

Python 3版本