多重处理导致Python崩溃,并在调用fork()时在另一个线程中发出了错误消息


81

我是Python的新手,并尝试为我的for循环实现Multiprocessing模块。

我有一个存储在img_urls中的图像网址数组,我需要下载并应用一些Google视觉。

if __name__ == '__main__':

    img_urls = [ALL_MY_Image_URLS]
    runAll(img_urls)
    print("--- %s seconds ---" % (time.time() - start_time)) 

这是我的runAll()方法

def runAll(img_urls):
    num_cores = multiprocessing.cpu_count()

    print("Image URLS  {}",len(img_urls))
    if len(img_urls) > 2:
        numberOfImages = 0
    else:
        numberOfImages = 1

    start_timeProcess = time.time()

    pool = multiprocessing.Pool()
    pool.map(annotate,img_urls)
    end_timeProcess = time.time()
    print('\n Time to complete ', end_timeProcess-start_timeProcess)

    print(full_matching_pages)


def annotate(img_path):
    file =  requests.get(img_path).content
    print("file is",file)
    """Returns web annotations given the path to an image."""
    print('Process Working under ',os.getpid())
    image = types.Image(content=file)
    web_detection = vision_client.web_detection(image=image).web_detection
    report(web_detection)

当我运行它并且python崩溃时,我将其作为警告

objc[67570]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[67570]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[67567]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[67567]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[67568]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[67568]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[67569]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[67569]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[67571]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[67571]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[67572]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[67572]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

您在OSX上吗?然后,该错误报告可能会给您一些提示。
IonicSolutions

哦,是的,我在OSX上,谢谢您将我指向该链接。
SriTeja Chilakamarri

仍然没有运气尝试设置OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES如上所述,仍然会遇到相同的错误。@IonicSolutions
SriTeja Chilakamarri

不幸的是,我对此主题没有特定的知识。我所能做的就是使用Google查找相关问题,例如,这种可能的解决方法
IonicSolutions

1
这是由于High Sierra以来Apple改变了macOSfork()行为OBJC_DISABLE_INITIALIZE_FORK_SAFETY=yes默认情况下,该变量关闭其较新的ObjectiveC框架通常立即执行的立即崩溃行为。这可能会影响fork()在macOS上使用多线程/多处理功能的任何语言>= 10.13,尤其是在使用“本机扩展” / C代码扩展的情况下。
TrinitronX

Answers:


203

发生此错误是由于增加了安全性以限制Mac OS High Sierra中的多线程。我知道这个答案有点晚了,但是我使用以下方法解决了这个问题:

设置环境变量.bash_profile以允许新的Mac OS High Sierra安全规则下的多线程应用程序或脚本。

打开一个终端:

$ nano .bash_profile

将以下行添加到文件末尾:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

保存,退出,关闭终端,然后重新打开终端。检查是否已设置环境变量:

$ env

您将看到类似于以下内容的输出:

TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/pn/vasdlj3ojO#OOas4dasdffJq/T/
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.E7qLFJDSo/Render
TERM_PROGRAM_VERSION=404
TERM_SESSION_ID=NONE
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

现在,您应该可以使用多线程运行python脚本了。


10
这实际上为我解决了。我想跨多个线程迭代一个大熊猫数据框,并遇到操作员描述的同一问题。这个答案为我解决了这个问题。唯一的不同是,我使用运行的脚本设置了env变量:OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES python my-script.py
rodrigo-silveira

3
非常感谢!对于那些感兴趣的人,这在macOS Mojave上对我有用。
nmetts

这解决了我的问题,但是我的脚本使用了多重处理
lollerskates

它可以在MacOS Mojave的机器上运行,但是随后pytest测试不再并行运行。以前,它崩溃了,但是至少它很快了……
onekiloparsec,

1
这个环境变量解决了我在Mac(catalina)上本地运行ansible的问题
itsjwala

0

OBJC_DISABLE_INITIALIZE_FORK_SAFETY在环境中不带标志的对我有效的解决方案包括multiprocessing.Poolmain()程序启动后立即初始化类。

这很可能不是最快的解决方案,我不确定它是否在所有情况下都有效,但是,在程序启动之前足够早地预热工作进程不会导致任何... may have been in progress in another thread when fork() was called错误,与之相比,我的确可以显着提高性能与非并行代码所得到的一样。

我创建了一个便利类Parallelizer,该类很早就开始了,然后在程序的整个生命周期中使用。

# entry point to my program
def main():
    parallelizer = Parallelizer()
    ...

然后,每当要进行并行化时:

# this function is parallelized. it is run by each child process.
def processing_function(input):
    ...
    return output

...
inputs = [...]
results = parallelizer.map(
    inputs,
    processing_function
)

和并行器类:

class Parallelizer:
    def __init__(self):
        self.input_queue = multiprocessing.Queue()
        self.output_queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(multiprocessing.cpu_count(),
                                         Parallelizer._run,
                                         (self.input_queue, self.output_queue,))

    def map(self, contents, processing_func):
        size = 0
        for content in contents:
            self.input_queue.put((content, processing_func))
            size += 1
        results = []
        while size > 0:
            result = self.output_queue.get(block=True)
            results.append(result)
            size -= 1
        return results

    @staticmethod
    def _run(input_queue, output_queue):
        while True:
            content, processing_func = input_queue.get(block=True)
            result = processing_func(content)
            output_queue.put(result)

一个警告:并行化的代码可能难以调试,因此我还准备了我的类的非并行化版本,当子进程出现问题时可以启用该版本:

class NullParallelizer:
    @staticmethod
    def map(contents, processing_func):
        results = []
        for content in contents:
            results.append(processing_func(content))
        return results
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.