多处理-管道与队列

151

Python的多处理程序包中的队列和管道之间的根本区别是什么？

在什么情况下应该选择一种？什么时候使用比较有利Pipe()？什么时候使用比较有利Queue()？

— 乔纳森
source

281

A Pipe()只能有两个端点。
一个Queue()可以有多个生产者和消费者。

何时使用它们

如果您需要两个以上的交流点，请使用Queue()。

如果您需要绝对的性能，那么a Pipe()会更快，因为Queue()它建立在之上Pipe()。

绩效基准

假设您要生成两个进程并在它们之间尽快发送消息。这些是使用Pipe()和进行类似测试之间的拖动竞赛的计时结果Queue()。这是在运行Ubuntu 11.10和Python 2.7.2的ThinkpadT61上进行的。

仅供参考，我把成绩JoinableQueue()作为奖金；JoinableQueue()在queue.task_done()调用时说明任务（它甚至不知道特定任务，它只计算队列中未完成的任务），以便queue.join()知道工作已完成。

每个答案底部的代码...

mpenning@mpenning-T61:~$ python multi_pipe.py 
Sending 10000 numbers to Pipe() took 0.0369849205017 seconds
Sending 100000 numbers to Pipe() took 0.328398942947 seconds
Sending 1000000 numbers to Pipe() took 3.17266988754 seconds
mpenning@mpenning-T61:~$ python multi_queue.py 
Sending 10000 numbers to Queue() took 0.105256080627 seconds
Sending 100000 numbers to Queue() took 0.980564117432 seconds
Sending 1000000 numbers to Queue() took 10.1611330509 seconds
mpnening@mpenning-T61:~$ python multi_joinablequeue.py 
Sending 10000 numbers to JoinableQueue() took 0.172781944275 seconds
Sending 100000 numbers to JoinableQueue() took 1.5714070797 seconds
Sending 1000000 numbers to JoinableQueue() took 15.8527247906 seconds
mpenning@mpenning-T61:~$

概括而言，Pipe()它的速度比速度快3倍Queue()。JoinableQueue()除非您确实必须拥有这些好处，否则请不要考虑。

奖励材料2

除非您知道一些捷径，否则多处理会在信息流中引入微妙的变化，使调试变得困难。例如，在许多情况下，当您通过字典建立索引时，您的脚本可能运行良好，但是某些输入很少会失败。

通常，当整个python进程崩溃时，我们会获得有关失败的线索；但是，如果多处理功能崩溃，则不会在控制台上打印未经请求的崩溃回溯。很难找到未知的多处理崩溃，而又不知道导致进程崩溃的线索。

我发现跟踪多处理崩溃信息的最简单方法是将整个多处理功能包装在try/中except并使用traceback.print_exc()：

import traceback
def run(self, args):
    try:
        # Insert stuff to be multiprocessed here
        return args[0]['that']
    except:
        print "FATAL: reader({0}) exited while multiprocessing".format(args) 
        traceback.print_exc()

现在，当您发现崩溃时，您会看到类似以下内容的信息：

FATAL: reader([{'crash': 'this'}]) exited while multiprocessing
Traceback (most recent call last):
  File "foo.py", line 19, in __init__
    self.run(args)
  File "foo.py", line 46, in run
    KeyError: 'that'

源代码：

"""
multi_pipe.py
"""
from multiprocessing import Process, Pipe
import time

def reader_proc(pipe):
    ## Read from the pipe; this will be spawned as a separate Process
    p_output, p_input = pipe
    p_input.close()    # We are only reading
    while True:
        msg = p_output.recv()    # Read from the output pipe and do nothing
        if msg=='DONE':
            break

def writer(count, p_input):
    for ii in xrange(0, count):
        p_input.send(ii)             # Write 'count' numbers into the input pipe
    p_input.send('DONE')

if __name__=='__main__':
    for count in [10**4, 10**5, 10**6]:
        # Pipes are unidirectional with two endpoints:  p_input ------> p_output
        p_output, p_input = Pipe()  # writer() writes to p_input from _this_ process
        reader_p = Process(target=reader_proc, args=((p_output, p_input),))
        reader_p.daemon = True
        reader_p.start()     # Launch the reader process

        p_output.close()       # We no longer need this part of the Pipe()
        _start = time.time()
        writer(count, p_input) # Send a lot of stuff to reader_proc()
        p_input.close()
        reader_p.join()
        print("Sending {0} numbers to Pipe() took {1} seconds".format(count,
            (time.time() - _start)))

"""
multi_queue.py
"""

from multiprocessing import Process, Queue
import time
import sys

def reader_proc(queue):
    ## Read from the queue; this will be spawned as a separate Process
    while True:
        msg = queue.get()         # Read from the queue and do nothing
        if (msg == 'DONE'):
            break

def writer(count, queue):
    ## Write to the queue
    for ii in range(0, count):
        queue.put(ii)             # Write 'count' numbers into the queue
    queue.put('DONE')

if __name__=='__main__':
    pqueue = Queue() # writer() writes to pqueue from _this_ process
    for count in [10**4, 10**5, 10**6]:             
        ### reader_proc() reads from pqueue as a separate process
        reader_p = Process(target=reader_proc, args=((pqueue),))
        reader_p.daemon = True
        reader_p.start()        # Launch reader_proc() as a separate python process

        _start = time.time()
        writer(count, pqueue)    # Send a lot of stuff to reader()
        reader_p.join()         # Wait for the reader to finish
        print("Sending {0} numbers to Queue() took {1} seconds".format(count, 
            (time.time() - _start)))

"""
multi_joinablequeue.py
"""
from multiprocessing import Process, JoinableQueue
import time

def reader_proc(queue):
    ## Read from the queue; this will be spawned as a separate Process
    while True:
        msg = queue.get()         # Read from the queue and do nothing
        queue.task_done()

def writer(count, queue):
    for ii in xrange(0, count):
        queue.put(ii)             # Write 'count' numbers into the queue

if __name__=='__main__':
    for count in [10**4, 10**5, 10**6]:
        jqueue = JoinableQueue() # writer() writes to jqueue from _this_ process
        # reader_proc() reads from jqueue as a different process...
        reader_p = Process(target=reader_proc, args=((jqueue),))
        reader_p.daemon = True
        reader_p.start()     # Launch the reader process
        _start = time.time()
        writer(count, jqueue) # Send a lot of stuff to reader_proc() (in different process)
        jqueue.join()         # Wait for the reader to finish
        print("Sending {0} numbers to JoinableQueue() took {1} seconds".format(count, 
            (time.time() - _start)))

— 迈克·彭宁顿
source

@Jonathan“总之Pipe（）比Queue（）快三倍”

— James Brady

优秀的！好的答案，很高兴您提供了基准！我只有两个小问题：（1）“快几个数量级”有点夸大其词。差异为x3，约为一个数量级的三分之一。只是说。;-); （2）比较公平的比较是正在运行的N个工作程序，每个工作人员都通过点对点管道与主线程进行通信，而运行中的N个工作程序的性能都是从单个点对多点队列中提取的。

— JJC 2012年

对您的“奖金材料” ...是的。如果要对Process进行子类化，请将大部分“ run”方法放在try块中。这也是记录异常的有用方法。复制普通异常输出：sys.stderr.write（''。join（traceback.format_exception（*（sys.exc_info（））））））

— travc

@ alexpinho98-但是您将需要一些带外数据以及关联的信令模式，以指示您发送的不是常规数据而是错误数据。鉴于发起过程已经处于不可预测的状态，这可能要问的太多了。

— scytale

@JJC要对您的测验进行测验，则3x大约是一个数量级，而不是三分之一– sqrt（10）=

— 〜3。– jab

另一个Queue()值得注意的功能是进纸器线程。本节指出：“当进程首先将项目放入队列时，将启动一个供料器线程，该线程将对象从缓冲区转移到管道中。” 可以插入无数（或maxsize）个项目，Queue()而无需任何queue.put()阻塞。这使您可以将多个项目存储在中Queue()，直到程序准备处理它们为止。

Pipe()另一方面，对于已发送到一个连接但尚未从另一连接接收的项目，则具有有限的存储量。存储空间用完后，对的调用connection.send()将阻塞，直到有空间写入整个项目为止。这将使线程停止写入操作，直到从管道读取其他线程为止。Connection对象使您可以访问基础文件描述符。在* nix系统上，您可以connection.send()使用os.set_blocking()函数防止呼叫阻塞。但是，如果您尝试发送不适合管道文件的单个项目，这将导致问题。Linux的最新版本允许您增加文件的大小，但是允许的最大大小根据系统配置而有所不同。因此，您永远不应依赖于Pipe()缓冲数据。致电connection.send 可能会阻塞，直到从其他管道读取数据为止。

总之，当您需要缓冲数据时，队列是比管道更好的选择。即使您只需要在两点之间进行交流。

— 罗杰·艾扬格
source