我怎么能刮得更快

16

这里的工作是刮的API的网站，从开始https://xxx.xxx.xxx/xxx/1.json到https://xxx.xxx.xxx/xxx/1417749.json写它到底到MongoDB的。为此，我有以下代码：

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min = 1
max = 1417749
for n in range(min, max):
    response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n)))
    if response.status_code == 200:
        parsed = json.loads(response.text)
        inserted = com.insert_one(parsed)
        write_log.write(str(n) + "\t" + str(inserted) + "\n")
        print(str(n) + "\t" + str(inserted) + "\n")
write_log.close()

但是，完成这项任务需要大量时间。这里的问题是如何加快这一过程。

— 泰克·纳特
source

您是否首先尝试确定处理单个json所需的时间？假设每条记录需要300毫秒，那么您可以在5天内顺序处理所有这些记录。

— tuxdna

5

如果您不想使用多线程，asyncio也是一种解决方案

import time
import pymongo
import json
import asyncio
from aiohttp import ClientSession


async def get_url(url, session):
    async with session.get(url) as response:
        if response.status == 200:
            return await response.text()


async def create_task(sem, url, session):
    async with sem:
        response = await get_url(url, session)
        if response:
            parsed = json.loads(response)
            n = url.rsplit('/', 1)[1]
            inserted = com.insert_one(parsed)
            write_log.write(str(n) + "\t" + str(inserted) + "\n")
            print(str(n) + "\t" + str(inserted) + "\n")


async def run(minimum, maximum):
    url = 'https:/xx.xxx.xxx/{}.json'
    tasks = []
    sem = asyncio.Semaphore(1000)   # Maximize the concurrent sessions to 1000, stay below the max open sockets allowed
    async with ClientSession() as session:
        for n in range(minimum, maximum):
            task = asyncio.ensure_future(create_task(sem, url.format(n), session))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses


client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min_item = 1
max_item = 100

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(min_item, max_item))
loop.run_until_complete(future)
write_log.close()

— 弗兰斯
source

1

使用异步比多线程工作得更快。

— Tek Nath

感谢您的反馈。有趣的结果。

— 弗朗斯

10

您可以做几件事：

重用连接。根据下面的基准，它快了大约3倍
您可以并行抓取多个进程

从这里并行代码

from threading import Thread
from Queue import Queue
q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=doWork)
    t.daemon = True
    t.start()
try:
    for url in open('urllist.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

从时序这一问题的可重用连接

>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
...
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
52.74904417991638
>>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
15.770191192626953

— keiv.fly
source

6

您可以从两个方面改进代码：

使用Session，这样就不会在每次请求时都重新安排连接并保持打开状态；
在代码中使用asyncio;

在这里看看https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

— Albestro
source

2

您可以添加更多详细信息吗？

— Tek Nath

4

您可能正在寻找的是异步抓取。我建议您创建一些URL批次，即5个URL（尽量不要破坏网站），并以异步方式刮取它们。如果您不太了解异步，请使用Google for libary asyncio。希望我能为您提供帮助:)

— 吹笛者
source

1

您可以添加更多详细信息吗？

— Tek Nath

3

尝试对请求进行分块，然后使用MongoDB批量写入操作。

将请求分组（每组100个请求）
遍历各组
使用异步请求模型来获取数据（组中的URL）
完成组后更新数据库（批量写入操作）

这可能通过以下方式节省大量时间* MongoDB写延迟*同步网络调用延迟

但是不要增加并行请求计数（块大小），这会增加服务器的网络负载，服务器可能将其视为DDoS攻击。

https://api.mongodb.com/python/current/examples/bulk.html

— 土瓦4
source

1

您能否提供有关将请求分组和提取分组的代码的帮助

— Tek Nath

3

假设您不会受到API的限制并且没有速率限制，那么此代码应使处理过程加快50倍（可能会更快，因为现在所有请求都使用同一会话发送）。

import pymongo
import threading

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
logs=[]

number_of_json_objects=1417750
number_of_threads=50

session=requests.session()

def scrap_write_log(session,start,end):
    for n in range(start, end):
        response = session.get("https:/xx.xxx.xxx/{}.json".format(n))
        if response.status_code == 200:
            try:
                logs.append(str(n) + "\t" + str(com.insert_one(json.loads(response.text))) + "\n")
                print(str(n) + "\t" + str(inserted) + "\n")
            except:
                logs.append(str(n) + "\t" + "Failed to insert" + "\n")
                print(str(n) + "\t" + "Failed to insert" + "\n")

thread_ranges=[[x,x+number_of_json_objects//number_of_threads] for x in range(0,number_of_json_objects,number_of_json_objects//number_of_threads)]

threads=[threading.Thread(target=scrap_write_log, args=(session,start_and_end[0],start_and_end[1])) for start_and_end in thread_ranges]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

with open("logging.log", "a") as f:
    for line in logs:
        f.write(line)

— 易卜拉欣·达尔
source

2

多年前，我碰巧遇到了同样的问题。我从未对基于python的答案感到满意，这些答案太慢或太复杂。切换到其他成熟的工具后，速度很快，而且我再也没有回来。

最近，我使用这样的步骤来加快过程，如下所示。

在txt中生成一堆网址
用于aria2c -x16 -d ~/Downloads -i /path/to/urls.txt下载这些文件
本地解析

这是我到目前为止提出的最快的过程。

在抓取网页方面，我什至下载了必要的* .html，而不是一次访问该页面，这实际上没有任何区别。当您点击访问页面时，使用requests或scrapy或python工具urllib，它仍会为您缓存并下载整个Web内容。

— 匿名
source

1

首先创建所有链接的列表，因为所有链接都相同，只需对其进行迭代即可。

list_of_links=[]
for i in range(1,1417749):
    list_of_links.append("https:/xx.xxx.xxx/{}.json".format(str(i)))

t_no=2
for i in range(0, len(list_of_links), t_no):
    all_t = []
    twenty_links = list_of_links[i:i + t_no]
    for link in twenty_links:
        obj_new = Demo(link,)
        t = threading.Thread(target=obj_new.get_json)
        t.start()
        all_t.append(t)
    for t in all_t:
        t.join()

class Demo:
    def __init__(self, url):
        self.json_url = url

def get_json(self):
    try:
       your logic
    except Exception as e:
       print(e)

通过简单地增加或减少t_no，您可以不更改线程数。

— 莫宾·阿尔哈桑（Mobin Alhassan）
source