Skip to main content

Airflow

Apache Airflow is an open-source workflow automation and orchestration platform that helps organizations schedule, monitor, and manage complex data workflows. It is a popular choice for organizations of all sizes to handle data pipelines, ETL processes, and task automation.

Features

FeatureDescription
Directed acyclic graphs (DAGs)Each job in Prophecy represents an Airflow DAG, and contains gems that represent various tasks.
Dynamic workflow schedulingIn the job settings, you can configure complex dependencies, retries, and time-based triggers for tasks.
ExtensibilityCustom gems can be written and used within Airflow jobs. These gems can be shared via the Package Hub.
Monitoring and loggingTask progress, logging, and alerting can be monitored from the Airflow environment or the Prophecy UI.
Parallel execution and scalabilityAirflow can execute multiple tasks in parallel and scale horizontally to accommodate high workloads.
Visual and code viewsVisually-developed Airflow jobs are automatically compiled into high-quality and open-source Airflow code.

Airflow Gems

Prophecy provides a set of gems that you can use in your Airflow jobs. They are listed below and grouped by category.

Sensor

GemDescription
EMRClusterSensorWaits for a specific EMR cluster state
EMRPipelineSensorWaits for a specific EMR pipeline state
HTTPSensorWaits for a specific HTTP endpoint
S3FileSensorWaits for a specific S3 file
SFTPSensorWaits for a specific file on an SFTP path

Data Transfer

GemDescription
SFTPToS3Transfer files from an SFTP server to Amazon S3
SFTPToSnowflakeCopies data from SFTP to Snowflake and creates a table
TableauExtractGenerates Tableau Hyper files

Spark/SQL

GemDescription
DatabricksPipelineRuns a Databricks pipeline (Spark or SQL)
DataprocPipelineRuns a pipeline on Dataproc
DataprocServerlessPipelineRuns a pipeline on Dataproc serverless
DataprocTestingRuns a pipeline on Dataproc
EMRCreateClusterCreates an EMR cluster
EMRPipelineRuns an pipeline on EMR
ModelRuns a SQL model
OnPremPipelineRuns a pipeline on a Hadoop cluster
SnowflakeSQLRuns a SQL query on Snowflake

Trigger/Notify

GemDescription
EmailSends a notification via email
SlackSends a notification via Slack
TriggerDagTrigger another Airflow job via its name or DAG ID

Custom

GemDescription
BranchInsert a Python lambda or function
PythonInsert Python code
ScriptInsert a Shell script

Subgraph

GemDescription
ForEachLoopIterate through a set of gems depending on code, a JSON list, or gem output
TaskGroupGroup a set of gems in your Airflow job