Python多处理和共享计数器


72

我在多处理模块上遇到了麻烦。我正在使用具有其map方法的工作人员池从大量文件中加载数据,对于每个文件,我都使用自定义函数来分析数据。每次处理文件时,我都希望更新一个计数器,以便可以跟踪还有多少文件需要处理。这是示例代码:

def analyze_data( args ):
    # do something 
    counter += 1
    print counter


if __name__ == '__main__':

    list_of_files = os.listdir(some_directory)

    global counter
    counter = 0

    p = Pool()
    p.map(analyze_data, list_of_files)

我找不到解决方案。

Answers:


79

问题在于该counter变量未在您的进程之间共享:每个单独的进程都在创建它自己的本地实例并对其进行递增。

有关可用于在进程之间共享状态的某些技术,请参阅文档的本部分。在您的情况下,您可能希望Value在工作人员之间共享一个实例

这是示例的工作版本(带有一些虚拟输入数据)。请注意,它使用的是全局值,实际上我会尽量避免使用这些值:

from multiprocessing import Pool, Value
from time import sleep

counter = None

def init(args):
    ''' store the counter for later use '''
    global counter
    counter = args

def analyze_data(args):
    ''' increment the global counter, do something with the input '''
    global counter
    # += operation is not atomic, so we need to get a lock:
    with counter.get_lock():
        counter.value += 1
    print counter.value
    return args * 10

if __name__ == '__main__':
    #inputs = os.listdir(some_directory)

    #
    # initialize a cross-process counter and the input lists
    #
    counter = Value('i', 0)
    inputs = [1, 2, 3, 4]

    #
    # create the pool of workers, ensuring each one receives the counter 
    # as it starts. 
    #
    p = Pool(initializer = init, initargs = (counter, ))
    i = p.map_async(analyze_data, inputs, chunksize = 1)
    i.wait()
    print i.get()

3
@jkp,如果没有全局变量怎么办?-我正在尝试使用一个类,但这并不像看起来那样简单。参见stackoverflow.com/questions/1816958/…–
安娜

26
不幸的是,此示例似乎有缺陷,因为counter.value += 1在进程之间不是原子的,因此如果使用几个进程运行足够长时间,则该值将是错误的
Eli Bendersky

2
与Eli所说的一致,声明Lock必须包含counter value += 1在内。参见stackoverflow.com/questions/1233222/…–
Acumenus

3
请注意,应该不是with counter.get_lock(),而是with counter.value.get_lock():
Jingjing Shi

1
正如@ Jinghao-shi所说,@ jkpcounter.value.get_lock()将会产生AttributeError: 'int' object has no attribute 'get_lock'
塞缪尔

40

没有竞争条件错误的计数器类:

class Counter(object):
    def __init__(self):
        self.val = multiprocessing.Value('i', 0)

    def increment(self, n=1):
        with self.val.get_lock():
            self.val.value += n

    @property
    def value(self):
        return self.val.value

对于作品有类似的代码joblib小号Parallel(在这个答案的代码不与工作joblib),见github.com/davidheryanto/etc/blob/master/python-recipes/...
丹尼森鲍姆

1
我也想补充return selfincrement功能,使链接
鲍里斯戈列利克

10

一个非常简单的示例,从jkp的答案更改为:

from multiprocessing import Pool, Value
from time import sleep

counter = Value('i', 0)
def f(x):
    global counter
    with counter.get_lock():
        counter.value += 1
    print("counter.value:", counter.value)
    sleep(1)
    return x

with Pool(4) as p:
    r = p.map(f, range(1000*1000))

4

更快的Counter类,而无需两次使用Value的内置锁

class Counter(object):
    def __init__(self, initval=0):
        self.val = multiprocessing.RawValue('i', initval)
        self.lock = multiprocessing.Lock()

    def increment(self):
        with self.lock:
            self.val.value += 1

    @property
    def value(self):
        return self.val.value

https://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing https://docs.python.org/2/library/multiprocessing.html#multiprocessing.sharedctypes.Value https: //docs.python.org/2/library/multiprocessing.html#multiprocessing.sharedctypes.RawValue


0

我正在PyQT5的进程栏上工作,所以我同时使用线程和池

import threading
import multiprocessing as mp
from queue import Queue

def multi(x):
    return x*x

def pooler(q):
    with mp.Pool() as pool:
    count = 0
    for i in pool.imap_unordered(ggg, range(100)):
        print(count, i)
        count += 1
        q.put(count)

def main():
    q = Queue()
    t = threading.Thread(target=thr, args=(q,))
    t.start()
    print('start')
    process = 0
    while process < 100:
        process = q.get()
        print('p',process)
if __name__ == '__main__':
    main()

我把它放在Qthread worker中,并且可以接受可接受的延迟

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.