在Airflow中创建动态工作流程的正确方法


96

问题

Airflow中是否有任何方法可以创建工作流,使得任务B. *的数量在任务A完成之前是未知的?我看过subdags,但看起来它只能与必须在Dag创建时确定的一组静态任务一起使用。

dag触发器会起作用吗?如果可以的话,请提供一个例子。

我有一个问题,在任务A完成之前,无法知道计算任务C所需的任务B的数量。每个任务B. *将花费数小时才能计算,并且无法合并。

              |---> Task B.1 --|
              |---> Task B.2 --|
 Task A ------|---> Task B.3 --|-----> Task C
              |       ....     |
              |---> Task B.N --|

想法#1

我不喜欢这种解决方案,因为我必须创建一个阻塞的ExternalTask​​Sensor,并且所有任务B. *将需要2-24小时才能完成。因此,我认为这不是可行的解决方案。当然有更简单的方法吗?还是不是为此设计了Airflow?

Dag 1
Task A -> TriggerDagRunOperator(Dag 2) -> ExternalTaskSensor(Dag 2, Task Dummy B) -> Task C

Dag 2 (Dynamically created DAG though python_callable in TriggerDagrunOperator)
               |-- Task B.1 --|
               |-- Task B.2 --|
Task Dummy A --|-- Task B.3 --|-----> Task Dummy B
               |     ....     |
               |-- Task B.N --|

编辑1:

到目前为止,这个问题还没有一个很好的答案。我已经与寻求解决方案的几个人联系。


所有任务B *是否相似,因为它们可以循环创建?
Daniel Lee

是的,一旦任务A完成,就可以快速创建所有B. *任务。任务A大约需要2个小时才能完成。
costrouc

您找到解决问题的方法了吗?您介意发布吗?
丹尼尔·杜波夫斯基

3
:为创意#1的有用资源linkedin.com/pulse/...
胡安里亚萨

1
这是我写的一篇文章,解释了如何执行linkedin.com/pulse/dynamic-workflows-airflow-kyle-bridenstine
Kyle Bridenstine

Answers:


30

这是我在没有任何子查询的情况下以类似要求执行的操作:

首先创建一个返回所需值的方法

def values_function():
     return values

下一个将动态生成作业的create方法:

def group(number, **kwargs):
        #load the values if needed in the command you plan to execute
        dyn_value = "{{ task_instance.xcom_pull(task_ids='push_func') }}"
        return BashOperator(
                task_id='JOB_NAME_{}'.format(number),
                bash_command='script.sh {} {}'.format(dyn_value, number),
                dag=dag)

然后合并它们:

push_func = PythonOperator(
        task_id='push_func',
        provide_context=True,
        python_callable=values_function,
        dag=dag)

complete = DummyOperator(
        task_id='All_jobs_completed',
        dag=dag)

for i in values_function():
        push_func >> group(i) >> complete

值在哪里定义?
僧侣

11
而不是for i in values_function()我期望这样的事情for i in push_func_output。问题是我找不到动态获取该输出的方法。执行后,PythonOperator的输出将在Xcom中,但是我不知道是否可以从DAG定义中引用它。
Ena

@Ena您是否找到一种方法来实现这一目标?
eldos

1
@eldos在下面看到我的答案
Ena

1
如果我们必须在循环中执行一系列与步骤相关的步骤怎么办?group函数中是否会有第二个依赖链?
CodingInCircles

12

我已经找到了一种根据先前任务的结果创建工作流的方法。
基本上,您想要做的是具有两个以下子子项:

  1. Xcom在首先执行的子数据中推送一个列表(或以后需要创建动态工作流的内容)(请参见test1.py def return_list()
  2. 将主要dag对象作为参数传递给第二个subdag
  3. 现在,如果您有主dag对象,则可以使用它来获取其任务实例的列表。从该任务实例列表中,您可以使用parent_dag.get_task_instances(settings.Session, start_date=parent_dag.get_active_runs()[-1])[-1])过滤掉当前运行的一项任务,您可能会在此处添加更多过滤器。
  4. 在该任务实例中,可以通过将dag_id指定为第一个subdag之一来使用xcom pull获得所需的值: dag_id='%s.%s' % (parent_dag_name, 'test1')
  5. 使用列表/值动态创建任务

现在,我已经在本地气流安装中对此进行了测试,并且工作正常。我不知道如果同时运行多个dag实例,xcom pull部分是否会有问题,但是您可能会使用唯一键或类似的东西来唯一地标识xcom想要的价值。可能可以将3.步骤优化为100%确保获得当前主dag的特定任务,但是对于我的使用来说,这执行得很好,我认为一个人只需要一个task_instance对象即可使用xcom_pull。

另外,在每次执行之前,我都会为第一个子数据清理xcoms,以确保不会意外获得任何错误的值。

我很难解释,所以我希望下面的代码能使所有内容变得清晰:

test1.py

from airflow.models import DAG
import logging
from airflow.operators.python_operator import PythonOperator
from airflow.operators.postgres_operator import PostgresOperator

log = logging.getLogger(__name__)


def test1(parent_dag_name, start_date, schedule_interval):
    dag = DAG(
        '%s.test1' % parent_dag_name,
        schedule_interval=schedule_interval,
        start_date=start_date,
    )

    def return_list():
        return ['test1', 'test2']

    list_extract_folder = PythonOperator(
        task_id='list',
        dag=dag,
        python_callable=return_list
    )

    clean_xcoms = PostgresOperator(
        task_id='clean_xcoms',
        postgres_conn_id='airflow_db',
        sql="delete from xcom where dag_id='{{ dag.dag_id }}'",
        dag=dag)

    clean_xcoms >> list_extract_folder

    return dag

test2.py

from airflow.models import DAG, settings
import logging
from airflow.operators.dummy_operator import DummyOperator

log = logging.getLogger(__name__)


def test2(parent_dag_name, start_date, schedule_interval, parent_dag=None):
    dag = DAG(
        '%s.test2' % parent_dag_name,
        schedule_interval=schedule_interval,
        start_date=start_date
    )

    if len(parent_dag.get_active_runs()) > 0:
        test_list = parent_dag.get_task_instances(settings.Session, start_date=parent_dag.get_active_runs()[-1])[-1].xcom_pull(
            dag_id='%s.%s' % (parent_dag_name, 'test1'),
            task_ids='list')
        if test_list:
            for i in test_list:
                test = DummyOperator(
                    task_id=i,
                    dag=dag
                )

    return dag

和主要工作流程:

test.py

from datetime import datetime
from airflow import DAG
from airflow.operators.subdag_operator import SubDagOperator
from subdags.test1 import test1
from subdags.test2 import test2

DAG_NAME = 'test-dag'

dag = DAG(DAG_NAME,
          description='Test workflow',
          catchup=False,
          schedule_interval='0 0 * * *',
          start_date=datetime(2018, 8, 24))

test1 = SubDagOperator(
    subdag=test1(DAG_NAME,
                 dag.start_date,
                 dag.schedule_interval),
    task_id='test1',
    dag=dag
)

test2 = SubDagOperator(
    subdag=test2(DAG_NAME,
                 dag.start_date,
                 dag.schedule_interval,
                 parent_dag=dag),
    task_id='test2',
    dag=dag
)

test1 >> test2

在Airflow 1.9上,将这些添加到DAG文件夹后没有加载,我错过了一些东西吗?
安东尼·基恩

@AnthonyKeane是否将test1.py和test2.py放入dag文件夹中名为subdags的文件夹中?
克里斯托弗·贝克

我是的 将两个文件都复制到subdags并将test.py放在dag文件夹中,仍然出现此错误。损坏的DAG:[/home/airflow/gcs/dags/test.py]没有名为subdags.test1的模块注意,我正在使用Google Cloud Composer(Google托管的Airflow 1.9.0)
Anthony Keane

@AnthonyKeane这是您在日志中看到的唯一错误吗?损坏的DAG可能是由于subdag出现编译错误而引起的。
克里斯托弗·贝克

3
嗨@Christopher Beck,我发现我需要添加_ _init_ _.py到subdags文件夹的错误。新秀错误
安东尼·基恩

8

是的,这是可能的,我创建了一个示例DAG来演示这一点。

import airflow
from airflow.operators.python_operator import PythonOperator
import os
from airflow.models import Variable
import logging
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
from airflow import DAG, settings
from airflow.operators.bash_operator import BashOperator

main_dag_id = 'DynamicWorkflow2'

args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2),
    'provide_context': True
}

dag = DAG(
    main_dag_id,
    schedule_interval="@once",
    default_args=args)


def start(*args, **kwargs):

    value = Variable.get("DynamicWorkflow_Group1")
    logging.info("Current DynamicWorkflow_Group1 value is " + str(value))


def resetTasksStatus(task_id, execution_date):
    logging.info("Resetting: " + task_id + " " + execution_date)

    dag_folder = conf.get('core', 'DAGS_FOLDER')
    dagbag = DagBag(dag_folder)
    check_dag = dagbag.dags[main_dag_id]
    session = settings.Session()

    my_task = check_dag.get_task(task_id)
    ti = TaskInstance(my_task, execution_date)
    state = ti.current_state()
    logging.info("Current state of " + task_id + " is " + str(state))
    ti.set_state(None, session)
    state = ti.current_state()
    logging.info("Updated state of " + task_id + " is " + str(state))


def bridge1(*args, **kwargs):

    # You can set this value dynamically e.g., from a database or a calculation
    dynamicValue = 2

    variableValue = Variable.get("DynamicWorkflow_Group2")
    logging.info("Current DynamicWorkflow_Group2 value is " + str(variableValue))

    logging.info("Setting the Airflow Variable DynamicWorkflow_Group2 to " + str(dynamicValue))
    os.system('airflow variables --set DynamicWorkflow_Group2 ' + str(dynamicValue))

    variableValue = Variable.get("DynamicWorkflow_Group2")
    logging.info("Current DynamicWorkflow_Group2 value is " + str(variableValue))

    # Below code prevents this bug: https://issues.apache.org/jira/browse/AIRFLOW-1460
    for i in range(dynamicValue):
        resetTasksStatus('secondGroup_' + str(i), str(kwargs['execution_date']))


def bridge2(*args, **kwargs):

    # You can set this value dynamically e.g., from a database or a calculation
    dynamicValue = 3

    variableValue = Variable.get("DynamicWorkflow_Group3")
    logging.info("Current DynamicWorkflow_Group3 value is " + str(variableValue))

    logging.info("Setting the Airflow Variable DynamicWorkflow_Group3 to " + str(dynamicValue))
    os.system('airflow variables --set DynamicWorkflow_Group3 ' + str(dynamicValue))

    variableValue = Variable.get("DynamicWorkflow_Group3")
    logging.info("Current DynamicWorkflow_Group3 value is " + str(variableValue))

    # Below code prevents this bug: https://issues.apache.org/jira/browse/AIRFLOW-1460
    for i in range(dynamicValue):
        resetTasksStatus('thirdGroup_' + str(i), str(kwargs['execution_date']))


def end(*args, **kwargs):
    logging.info("Ending")


def doSomeWork(name, index, *args, **kwargs):
    # Do whatever work you need to do
    # Here I will just create a new file
    os.system('touch /home/ec2-user/airflow/' + str(name) + str(index) + '.txt')


starting_task = PythonOperator(
    task_id='start',
    dag=dag,
    provide_context=True,
    python_callable=start,
    op_args=[])

# Used to connect the stream in the event that the range is zero
bridge1_task = PythonOperator(
    task_id='bridge1',
    dag=dag,
    provide_context=True,
    python_callable=bridge1,
    op_args=[])

DynamicWorkflow_Group1 = Variable.get("DynamicWorkflow_Group1")
logging.info("The current DynamicWorkflow_Group1 value is " + str(DynamicWorkflow_Group1))

for index in range(int(DynamicWorkflow_Group1)):
    dynamicTask = PythonOperator(
        task_id='firstGroup_' + str(index),
        dag=dag,
        provide_context=True,
        python_callable=doSomeWork,
        op_args=['firstGroup', index])

    starting_task.set_downstream(dynamicTask)
    dynamicTask.set_downstream(bridge1_task)

# Used to connect the stream in the event that the range is zero
bridge2_task = PythonOperator(
    task_id='bridge2',
    dag=dag,
    provide_context=True,
    python_callable=bridge2,
    op_args=[])

DynamicWorkflow_Group2 = Variable.get("DynamicWorkflow_Group2")
logging.info("The current DynamicWorkflow value is " + str(DynamicWorkflow_Group2))

for index in range(int(DynamicWorkflow_Group2)):
    dynamicTask = PythonOperator(
        task_id='secondGroup_' + str(index),
        dag=dag,
        provide_context=True,
        python_callable=doSomeWork,
        op_args=['secondGroup', index])

    bridge1_task.set_downstream(dynamicTask)
    dynamicTask.set_downstream(bridge2_task)

ending_task = PythonOperator(
    task_id='end',
    dag=dag,
    provide_context=True,
    python_callable=end,
    op_args=[])

DynamicWorkflow_Group3 = Variable.get("DynamicWorkflow_Group3")
logging.info("The current DynamicWorkflow value is " + str(DynamicWorkflow_Group3))

for index in range(int(DynamicWorkflow_Group3)):

    # You can make this logic anything you'd like
    # I chose to use the PythonOperator for all tasks
    # except the last task will use the BashOperator
    if index < (int(DynamicWorkflow_Group3) - 1):
        dynamicTask = PythonOperator(
            task_id='thirdGroup_' + str(index),
            dag=dag,
            provide_context=True,
            python_callable=doSomeWork,
            op_args=['thirdGroup', index])
    else:
        dynamicTask = BashOperator(
            task_id='thirdGroup_' + str(index),
            bash_command='touch /home/ec2-user/airflow/thirdGroup_' + str(index) + '.txt',
            dag=dag)

    bridge2_task.set_downstream(dynamicTask)
    dynamicTask.set_downstream(ending_task)

# If you do not connect these then in the event that your range is ever zero you will have a disconnection between your stream
# and your tasks will run simultaneously instead of in your desired stream order.
starting_task.set_downstream(bridge1_task)
bridge1_task.set_downstream(bridge2_task)
bridge2_task.set_downstream(ending_task)

在运行DAG之前,请创建以下三个气流变量

airflow variables --set DynamicWorkflow_Group1 1

airflow variables --set DynamicWorkflow_Group2 0

airflow variables --set DynamicWorkflow_Group3 0

您会看到DAG与此不同

在此处输入图片说明

跑完之后

在此处输入图片说明

您可以在我的有关在气流上创建动态工作流的文章中看到有关此DAG的更多信息。


1
但是,如果您有多个DAGRun的DAG,会发生什么情况。它们是否共享相同的变量?
3

1
是的,他们将使用相同的变量;我最后在我的文章中谈到了这一点。您将需要动态创建变量,并在变量名称中使用dag运行ID。我的示例很简单,只是为了演示动态可能性,但您需要提高生产质量:)
凯尔·布​​莱恩斯汀

创建动态任务时是否需要桥接?会暂时完全阅读您的文章,但想问一下。我现在正在努力根据上游任务创建一个动态任务,并且开始尝试找出问题所在。我当前的问题是由于某种原因,我无法让DAG同步到DAG-Bag中。当我在模块中使用静态列表时,我的DAG已同步,但是当我切换该静态列表以从上游任务构建时,DAG停止了。
lucid_goose

6

OA:“ Airflow中是否有任何方法可以创建工作流,使得任务B. *的数量在任务A完成之前是未知的?”

简短的答案是没有。气流将在开始运行之前建立DAG气流。

就是说,我们得出了一个简单的结论,那就是我们没有这种需要。当您要并行处理某些工作时,应评估可用资源,而不是要处理的项目数。

我们这样做是这样的:我们动态生成固定数量的任务(例如10),这些任务将拆分工作。例如,如果我们需要处理100个文件,则每个任务将处理10个文件。我将在今天晚些时候发布代码。

更新资料

这是代码,抱歉。

from datetime import datetime, timedelta

import airflow
from airflow.operators.dummy_operator import DummyOperator

args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2018, 1, 8),
    'email': ['myemail@gmail.com'],
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 1,
    'retry_delay': timedelta(seconds=5)
}

dag = airflow.DAG(
    'parallel_tasks_v1',
    schedule_interval="@daily",
    catchup=False,
    default_args=args)

# You can read this from variables
parallel_tasks_total_number = 10

start_task = DummyOperator(
    task_id='start_task',
    dag=dag
)


# Creates the tasks dynamically.
# Each one will elaborate one chunk of data.
def create_dynamic_task(current_task_number):
    return DummyOperator(
        provide_context=True,
        task_id='parallel_task_' + str(current_task_number),
        python_callable=parallelTask,
        # your task will take as input the total number and the current number to elaborate a chunk of total elements
        op_args=[current_task_number, int(parallel_tasks_total_number)],
        dag=dag)


end = DummyOperator(
    task_id='end',
    dag=dag)

for page in range(int(parallel_tasks_total_number)):
    created_task = create_dynamic_task(page)
    start_task >> created_task
    created_task >> end

代码说明:

在这里,我们有一个开始任务和一个结束任务(均为虚拟)。

然后从带有for循环的启动任务开始,使用相同的python可调用函数创建10个任务。这些任务在函数create_dynamic_task中创建。

对于每个可调用的python,我们将并行任务的总数和当前任务索引作为参数传递。

假设您要详细说明1000个项目:第一个任务将收到输入,说明应该详细说明10个块中的第一个块。它将把1000个项目分成10个块,并详细说明第一个。


1
只要您不需要每项特定任务(例如进度,结果,成功/失败,重试等),这就是一个很好的解决方案
Alonzzo2

@Ena parallelTask没有定义:我错过了什么吗?
Anthony Keane

2
@AnthonyKeane这是您应该调用以实际执行某项操作的python函数。正如代码中所注释的那样,它将以总数和当前数作为输入,以详细说明全部元素。
Ena

4

我认为您正在寻找的是动态创建DAG,几天前,经过一些搜索,我发现了此博客,我遇到了这种情况。

动态任务生成

start = DummyOperator(
    task_id='start',
    dag=dag
)

end = DummyOperator(
    task_id='end',
    dag=dag)

def createDynamicETL(task_id, callableFunction, args):
    task = PythonOperator(
        task_id = task_id,
        provide_context=True,
        #Eval is used since the callableFunction var is of type string
        #while the python_callable argument for PythonOperators only receives objects of type callable not strings.
        python_callable = eval(callableFunction),
        op_kwargs = args,
        xcom_push = True,
        dag = dag,
    )
    return task

设置DAG工作流程

with open('/usr/local/airflow/dags/config_files/dynamicDagConfigFile.yaml') as f:
    # Use safe_load instead to load the YAML file
    configFile = yaml.safe_load(f)

    # Extract table names and fields to be processed
    tables = configFile['tables']

    # In this loop tasks are created for each table defined in the YAML file
    for table in tables:
        for table, fieldName in table.items():
            # In our example, first step in the workflow for each table is to get SQL data from db.
            # Remember task id is provided in order to exchange data among tasks generated in dynamic way.
            get_sql_data_task = createDynamicETL('{}-getSQLData'.format(table),
                                                 'getSQLData',
                                                 {'host': 'host', 'user': 'user', 'port': 'port', 'password': 'pass',
                                                  'dbname': configFile['dbname']})

            # Second step is upload data to s3
            upload_to_s3_task = createDynamicETL('{}-uploadDataToS3'.format(table),
                                                 'uploadDataToS3',
                                                 {'previous_task_id': '{}-getSQLData'.format(table),
                                                  'bucket_name': configFile['bucket_name'],
                                                  'prefix': configFile['prefix']})

            # This is where the magic lies. The idea is that
            # once tasks are generated they should linked with the
            # dummy operators generated in the start and end tasks. 
            # Then you are done!
            start >> get_sql_data_task
            get_sql_data_task >> upload_to_s3_task
            upload_to_s3_task >> end

这是将代码放在一起后DAG的样子 在此处输入图片说明

import yaml
import airflow
from airflow import DAG
from datetime import datetime, timedelta, time
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator

start = DummyOperator(
    task_id='start',
    dag=dag
)


def createDynamicETL(task_id, callableFunction, args):
    task = PythonOperator(
        task_id=task_id,
        provide_context=True,
        # Eval is used since the callableFunction var is of type string
        # while the python_callable argument for PythonOperators only receives objects of type callable not strings.
        python_callable=eval(callableFunction),
        op_kwargs=args,
        xcom_push=True,
        dag=dag,
    )
    return task


end = DummyOperator(
    task_id='end',
    dag=dag)

with open('/usr/local/airflow/dags/config_files/dynamicDagConfigFile.yaml') as f:
    # use safe_load instead to load the YAML file
    configFile = yaml.safe_load(f)

    # Extract table names and fields to be processed
    tables = configFile['tables']

    # In this loop tasks are created for each table defined in the YAML file
    for table in tables:
        for table, fieldName in table.items():
            # In our example, first step in the workflow for each table is to get SQL data from db.
            # Remember task id is provided in order to exchange data among tasks generated in dynamic way.
            get_sql_data_task = createDynamicETL('{}-getSQLData'.format(table),
                                                 'getSQLData',
                                                 {'host': 'host', 'user': 'user', 'port': 'port', 'password': 'pass',
                                                  'dbname': configFile['dbname']})

            # Second step is upload data to s3
            upload_to_s3_task = createDynamicETL('{}-uploadDataToS3'.format(table),
                                                 'uploadDataToS3',
                                                 {'previous_task_id': '{}-getSQLData'.format(table),
                                                  'bucket_name': configFile['bucket_name'],
                                                  'prefix': configFile['prefix']})

            # This is where the magic lies. The idea is that
            # once tasks are generated they should linked with the
            # dummy operators generated in the start and end tasks. 
            # Then you are done!
            start >> get_sql_data_task
            get_sql_data_task >> upload_to_s3_task
            upload_to_s3_task >> end

充满希望,这很有帮助,也会帮助其他人


您自己实现了吗?我累了。但是我失败了。
纽特

是的,它对我有用。您要面对什么问题?
穆罕默德·本·阿里

1
我知道了。我的问题已经解决。谢谢。我只是没有获得在Docker映像中读取环境变量的正确方法。
纽特

如果表项可能会更改,那么我们不能将它们放在静态yaml文件中怎么办?
FrankZhu

3

我想我已经在https://github.com/mastak/airflow_multi_dagrun找到了一个更好的解决方案,它通过触发多个dagrun来使用DagRun的简单排队,类似于TriggerDagRuns。尽管我不得不修补一些细节以使其与最新的气流配合使用,但大多数功劳归于https://github.com/mastak

该解决方案使用一个触发多个DagRun自定义运算符

from airflow import settings
from airflow.models import DagBag
from airflow.operators.dagrun_operator import DagRunOrder, TriggerDagRunOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.state import State
from airflow.utils import timezone


class TriggerMultiDagRunOperator(TriggerDagRunOperator):
    CREATED_DAGRUN_KEY = 'created_dagrun_key'

    @apply_defaults
    def __init__(self, op_args=None, op_kwargs=None,
                 *args, **kwargs):
        super(TriggerMultiDagRunOperator, self).__init__(*args, **kwargs)
        self.op_args = op_args or []
        self.op_kwargs = op_kwargs or {}

    def execute(self, context):

        context.update(self.op_kwargs)
        session = settings.Session()
        created_dr_ids = []
        for dro in self.python_callable(*self.op_args, **context):
            if not dro:
                break
            if not isinstance(dro, DagRunOrder):
                dro = DagRunOrder(payload=dro)

            now = timezone.utcnow()
            if dro.run_id is None:
                dro.run_id = 'trig__' + now.isoformat()

            dbag = DagBag(settings.DAGS_FOLDER)
            trigger_dag = dbag.get_dag(self.trigger_dag_id)
            dr = trigger_dag.create_dagrun(
                run_id=dro.run_id,
                execution_date=now,
                state=State.RUNNING,
                conf=dro.payload,
                external_trigger=True,
            )
            created_dr_ids.append(dr.id)
            self.log.info("Created DagRun %s, %s", dr, now)

        if created_dr_ids:
            session.commit()
            context['ti'].xcom_push(self.CREATED_DAGRUN_KEY, created_dr_ids)
        else:
            self.log.info("No DagRun created")
        session.close()

然后,您可以从PythonOperator中的callable函数提交多个dagrun,例如:

from airflow.operators.dagrun_operator import DagRunOrder
from airflow.models import DAG
from airflow.operators import TriggerMultiDagRunOperator
from airflow.utils.dates import days_ago


def generate_dag_run(**kwargs):
    for i in range(10):
        order = DagRunOrder(payload={'my_variable': i})
        yield order

args = {
    'start_date': days_ago(1),
    'owner': 'airflow',
}

dag = DAG(
    dag_id='simple_trigger',
    max_active_runs=1,
    schedule_interval='@hourly',
    default_args=args,
)

gen_target_dag_run = TriggerMultiDagRunOperator(
    task_id='gen_target_dag_run',
    dag=dag,
    trigger_dag_id='common_target',
    python_callable=generate_dag_run
)

我在https://github.com/flinz/airflow_multi_dagrun上使用代码创建了一个fork


3

作业图不是在运行时生成的。而是由Airflow从dags文件夹中拾取图形时生成的图形。因此,实际上不可能在每次运行时都为该作业使用不同的图形。您可以配置作业以在加载时基于查询来构建图。此后的每次运行该图都将保持不变,这可能不是很有用。

您可以设计一个图形,使用分支运算符根据查询结果在每次运行中执行不同的任务。

我要做的是预配置一组任务,然后获取查询结果并将它们分布在各个任务中。无论如何,这可能会更好,因为如果您的查询返回很多结果,那么您可能就不想以很多并发任务来充斥调度程序。为了更加安全,我还使用了一个池来确保并发不会因意外的大查询而失控。

"""
 - This is an idea for how to invoke multiple tasks based on the query results
"""
import logging
from datetime import datetime

from airflow import DAG
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.mysql_operator import MySqlOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from include.run_celery_task import runCeleryTask

########################################################################

default_args = {
    'owner': 'airflow',
    'catchup': False,
    'depends_on_past': False,
    'start_date': datetime(2019, 7, 2, 19, 50, 00),
    'email': ['rotten@stackoverflow'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 0,
    'max_active_runs': 1
}

dag = DAG('dynamic_tasks_example', default_args=default_args, schedule_interval=None)

totalBuckets = 5

get_orders_query = """
select 
    o.id,
    o.customer
from 
    orders o
where
    o.created_at >= current_timestamp at time zone 'UTC' - '2 days'::interval
    and
    o.is_test = false
    and
    o.is_processed = false
"""

###########################################################################################################

# Generate a set of tasks so we can parallelize the results
def createOrderProcessingTask(bucket_number):
    return PythonOperator( 
                           task_id=f'order_processing_task_{bucket_number}',
                           python_callable=runOrderProcessing,
                           pool='order_processing_pool',
                           op_kwargs={'task_bucket': f'order_processing_task_{bucket_number}'},
                           provide_context=True,
                           dag=dag
                          )


# Fetch the order arguments from xcom and doStuff() to them
def runOrderProcessing(task_bucket, **context):
    orderList = context['ti'].xcom_pull(task_ids='get_open_orders', key=task_bucket)

    if orderList is not None:
        for order in orderList:
            logging.info(f"Processing Order with Order ID {order[order_id]}, customer ID {order[customer_id]}")
            doStuff(**op_kwargs)


# Discover the orders we need to run and group them into buckets for processing
def getOpenOrders(**context):
    myDatabaseHook = PostgresHook(postgres_conn_id='my_database_conn_id')

    # initialize the task list buckets
    tasks = {}
    for task_number in range(0, totalBuckets):
        tasks[f'order_processing_task_{task_number}'] = []

    # populate the task list buckets
    # distribute them evenly across the set of buckets
    resultCounter = 0
    for record in myDatabaseHook.get_records(get_orders_query):

        resultCounter += 1
        bucket = (resultCounter % totalBuckets)

        tasks[f'order_processing_task_{bucket}'].append({'order_id': str(record[0]), 'customer_id': str(record[1])})

    # push the order lists into xcom
    for task in tasks:
        if len(tasks[task]) > 0:
            logging.info(f'Task {task} has {len(tasks[task])} orders.')
            context['ti'].xcom_push(key=task, value=tasks[task])
        else:
            # if we didn't have enough tasks for every bucket
            # don't bother running that task - remove it from the list
            logging.info(f"Task {task} doesn't have any orders.")
            del(tasks[task])

    return list(tasks.keys())

###################################################################################################


# this just makes sure that there aren't any dangling xcom values in the database from a crashed dag
clean_xcoms = MySqlOperator(
    task_id='clean_xcoms',
    mysql_conn_id='airflow_db',
    sql="delete from xcom where dag_id='{{ dag.dag_id }}'",
    dag=dag)


# Ideally we'd use BranchPythonOperator() here instead of PythonOperator so that if our
# query returns fewer results than we have buckets, we don't try to run them all.
# Unfortunately I couldn't get BranchPythonOperator to take a list of results like the
# documentation says it should (Airflow 1.10.2). So we call all the bucket tasks for now.
get_orders_task = PythonOperator(
                                 task_id='get_orders',
                                 python_callable=getOpenOrders,
                                 provide_context=True,
                                 dag=dag
                                )
get_orders_task.set_upstream(clean_xcoms)

# set up the parallel tasks -- these are configured at compile time, not at run time:
for bucketNumber in range(0, totalBuckets):
    taskBucket = createOrderProcessingTask(bucketNumber)
    taskBucket.set_upstream(get_orders_task)


###################################################################################################

请注意,由于一项任务的结果,似乎有可能在运行中动态创建subdag,但是,我发现有关subdag的大多数文档都强烈建议您远离该功能,因为它会引起更多的问题而不是解决的问题在多数情况下。我已经看到有关将subdags作为内置功能删除的建议。
腐烂

还要注意,在for tasks in tasks示例循环中,我删除了要迭代的对象。那是个坏主意。相反,获取键列表并对其进行迭代-或跳过删除操作。同样,如果xcom_pull返回None(而不是列表或空列表),则for循环也会失败。可能要在“ for”之前运行xcom_pull,然后检查它是否为None-或确保那里至少有一个空列表。YMMV。祝好运!
腐烂

1
里面有什么open_order_task
alltej

没错,在我的示例中这是一个错字。它应该是get_orders_task.set_upstream()。我会解决的。
腐烂

0

不明白是什么问题?

是一个标准的例子。现在,如果在功能subdag中替换为for i in range(5):for i in range(random.randint(0, 10)):则一切正常。现在,假设运算符“开始”将数据放入文件中,并且该函数将读取该数据,而不是随机值。然后,操作员的“开始”将影响任务的数量。

该问题仅会出现在UI的显示中,因为当输入子数据时,任务数将等于当前从文件/数据库/ XCom读取的最后一个任务。它会自动限制一次一次的几次发射。


-1

我发现这篇中篇文章与这个问题非常相似。但是,它充满了错别字,并且在我尝试实现它时不起作用。

我对以上内容的回答如下:

如果要动态创建任务,则必须通过迭代上游任务未创建或可以独立于该任务定义的对象来进行。我了解到,您无法像许多其他人以前指出的那样,将执行日期或其他气流变量传递给模板之外的内容(例如任务)。另请参阅这篇文章


如果查看我的评论,您将看到实际上可以根据上游任务的结果来创建任务。
Christopher Beck '18
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.