Skip to main content

Build an Airflow job

Let's build a job with Airflow (or an Airflow DAG) to schedule and run your Spark and SQL pipelines. In this tutorial, you will:

  • Create a Prophecy-managed Airflow fabric and connections to AWS, Email, and Databricks.
  • Configure an Airflow job to run when a new file is uploaded to an S3 bucket.
  • Configure the same job to send an email notification prior to pipeline execution.
  • Run, debug, and enable the job.

Requirements

This tutorial requires:

  • A Databricks account
  • A Prophecy project with a Spark pipeline
  • An AWS account with an S3 bucket
  • An email account and access to SMTP settings

Create Airflow fabric

To run an Airflow job, you need to define an environment where you will run the job. Therefore, you need to create an Airflow fabric.

  1. Click the Create Entity button.
  2. Choose Create Fabric option.
  3. On the Basic Info page:
    • Enter a Fabric Name.
    • Enter a Fabric Description.
    • Choose the Team that will own the fabric.
  4. Click Continue.
  5. Configure the Provider information.
    • Choose Prophecy Managed.
  6. Click Continue.

Add an AWS connection

In order to trigger your Airflow job using an S3 File Sensor, you need to add an AQS connection in your Airflow fabric. In the Connections tab of the fabric setup:

  1. Click on the + Add Connection button.
  2. Select AWS in Connection Type.
  3. Provide a Connection Name to identify your connection.
  4. Add a Description of your choice.
  5. Provide the AWS Access Key ID and AWS Secret Access Key. Visit the AWS site to learn how to access these keys.
  6. Click Save to establish the connection.

Add_AWS_connection

Add an email connection

Next, we'll add an email connection that will be used to send emails on your behalf.

  1. Click on the + Add Connection button.
  2. Select Email in Connection Type.
  3. Provide a Connection Name to identify your connection.
  4. Add a Description of your choice.
  5. Provide the (4) Host as your SMTP host example smtp.gmail.com.
  6. Provide the login credentials for this server in (5)Login and (6)Password.
  7. Add your SMTP port in the Port field.
  8. Click Save to establish the connection.

Add_Email_connection.png

Add a Databricks Spark connection

To run your Databricks pipelines, you need to make a connection to your Databricks environment.

  1. Click on the + Add Connection button.
  2. Select Databricks (Spark) in Connection Type.
  3. Under Fabric, select a Databricks Spark fabric. This must be created beforehand.
  4. Optionally, provide a description in the Description box.
  5. Click Save to establish the connection.

Add_DB_Spark_connection.png

Create an Airflow job

A job contains gems that perform various tasks (running pipelines/models/scripts, etc.) that you can schedule to run at a certain frequency. Every job represents an Airflow DAG. Let's create a job.

  1. Click the Create Entity button.
  2. Choose Job option.
  3. Select the project in which you want to create the job. This project should contain at least one Spark pipeline.
  4. Pick your development branch. Here you can pick an existing branch for development, or create a new one.
  5. Provide a Name.
  6. Pick Airflow as the scheduler.
  7. Select the fabric that you just created.
  8. Fill in a Schedule that defines the job frequency. You can modify this at any time.
  9. Add a Description about the job you are creating.
  10. Click Create New.

Create_Job

This will take you to the job editor where you can start building the job.

Add a S3 file Sensor gem

In the job editor canvas:

  1. Click on Sensors.
  2. Drag the S3FileSensor gem from the dropdown to the canvas.
  3. Click and open the newly added gem to view the gem configurations. Add_S3_gem
  4. Specify the S3 bucket/path that we will track to trigger the job.
  5. In S3 Path(s), specify the complete path of file in your bucket. Airflow will check if this file exists in the specified bucket periodically and trigger the job when it arrives.
  6. Select the created Connection for AWS in Connection name.
  7. Click Save. Add_S3_gem_details

Add an email gem

  1. Click on Trigger/Notify.
  2. Drag the Email gem from the dropdown to the canvas.
  3. Connect the sensor gem output to the email gem input.
  4. Click and open the newly added gem to view the gem configurations. Add_email_gem
  5. In To, add your Email id where you want to receive the notification Email when the job is triggered.
  6. Select the Connection name, you created for Email.
  7. You can provide a Subject for the Email
  8. Add Email content that will populate the body of the email. For example, you can add an Airflow parameter available to access the execution time in a job.
  9. Optionally, add CC and BCC email addresses.
  10. Click Save. Add_email_gem_details

Add a Spark pipeline gem

  1. Click on Spark/SQL.
  2. Drag the DatabricksPipeline gem from the dropdown to the canvas.
  3. Connect the email gem output to the pipeline gem input.
  4. Click and open the newly added gem to view the gem configurations. Add_Pipeline_Gem
  5. Select the Pipeline to schedule you want to run.
  6. Select the configuration to use during the scheduled pipeline run.
  7. Pick Fabric and Cluster size to run this pipeline for running this pipeline in Databricks.
  8. Click Save. Add_pipeline_gem_details

Run and debug

Now that your job is ready, you can go ahead and run it.

  1. Click on the Run button to trigger an on-demand run. This creates a temporary DAG and uploads to Airflow.
  2. When you click on run, you will see a Job Details Toaster. Run_job
  3. Click on Details to open the detailed logs of the run.
  4. Review all the steps involved in the run and also detailed logs for each step.
  5. Click on + button to open the logs for a particular step. Run_job_logs

Release and deploy

Once you have developed and tested the job, it’s time to commit and push the code to the project Git repository and release the job to Airflow.

Open Git workflow

  1. Start by toggling our job to be Enabled. Once you release the project, Airflow will recognize that this job is enabled.
  2. Open the Git workflow by clicking the Release button in the project header or clicking the Commit button in the project footer.
  3. Commit your changes. This saves the current state of your project to your branch.
  4. Pull any upstream or remote changes. This ensures that your branch is up-to-date with collaborators' changes.
  5. Merge your changes into the main project branch. The main branch should contain the "official" version of your project. Use a Pull Request to merge your changes if the main branch is protected.
  6. Release your changes. When you release your project, you create a new version of it. At this point, your job will be built and deployed on Airflow.

Release_screen

Monitor

During the release process, Prophecy automatically packages, tests, and deploys your jobs to Airflow. Once the process is finished, you can see the deployed and running job within your Airflow environment.

What’s next

Great work! You've created an Airflow job in Prophecy, ran it successfully, released, scheduled, and monitored the job.

If you ever encounter any difficulties, don't hesitate to reach out to us (contact.us@Prophecy.io) or join our Slack community for assistance. We're here to help!