Skip to main content

Databricks Jobs

Once you have developed a Spark Pipeline using Prophecy, you will want to schedule it to run at some frequency. To support this, Prophecy provides a visual layer on top of Databricks Jobs for an easy orchestration.

Development

Your first Job

You can create a Job from two places. If you're going to schedule only a single Pipeline, the easiest way to build a Job for it is to do it directly from the Pipeline editor screen. This way your Job is automatically initialized with the Pipeline you create it from.

Databricks Job Creation From Pipeline

To do that, simply navigate to your Pipeline, and click on the Schedule button (1). That opens a modal that shows all the Jobs that refer to this Job or allow you to create a completely new Job from scratch. Upon clicking Create New (2) you are redirected to the Job building page.

Databricks Job Creation

Alternatively, if you'd like to create a new Job completely from scratch, you can do that directly from the entity creation page (1). There you can choose the Job tile (2) and that opens a similar modal where you can define your Job details (3).

Whenever, creating a new Job you're asked for the following details:

Field NameDescription
ProjectWhich Project to create the Job in. This controls who has access to the Job, groups Jobs together for lineage, and allows you to use Pipelines already published within that Project.
BranchWhich Git branch to use when developing this Job.
NameUnique Job name.
SchedulerThe underlying engine that's going to execute your Job. Databricks is recommended.
FabricThe execution Fabric to which the Job is going to be deployed.
Job SizeThe default size of the cluster that's going to be created for the Job to run.
Schedule IntervalDefines how often your Job is going to run. The interval is defined using the Quartz format. You can click on the clock icon to select the interval.
Alerts emailComma separated list of emails that are going to receive notifications on specific Job status events (start, failure, or success).

Building the Job

Example Databricks Job

Now that you've created your first Job, you can start adding Gems to the canvas to define which Pipelines will be run during the Job. To define dependencies between the Pipelines within the Job you can simply connect them by dragging-and-dropping the edges between Gems.

Two Gem types are available when defining Databricks Jobs:

Pipeline Gem

The Pipeline Gem triggers a Spark Pipeline developed in Prophecy.

Pipeline Component

Settings for Pipeline component can be inherited from overall Job configuration or can be set inside the component itself.

Script Gem

Script Gem can be used to write any ad-hoc code.

Script Component

Settings for script component can be inherited from overall Job configuration or can be set inside the component itself.

Visual == Code

The visual graph created on the Jobs page is automatically converted to code (JSON) in the backend which gets committed to Git.

Code View

Job Configuration

Example Configuration


Field NameDescription
SchedulerThe underlying engine that's going to execute your Job. Databricks is recommended.
FabricThe execution Fabric to which the Job is going to be deployed.
Cluster SizeThe default size of the cluster that's going to be created for the Job to run.
Cluster ModeCan be selected as Single (all Gems within the Job re-use the same Cluster) or Multi (all Gems within the Job run on a separate new cluster)
Schedule IntervalDefines how often your Job is going to run. The interval is defined using the Quartz format. You can click on the clock icon to select the interval.
Pipeline level TimeoutTimeout at Pipeline level
Alerts Email for PipelineComma separated list of emails, that are going to receive notifications on specific Job status events (Job start, failure, or success) for entire Pipeline.
Per Gem TimeoutTimeout for each Gem in Job Pipeline
Number of retries per GemNumber of retries for each Gem in Job Pipeline
Alerts Email per GemComma separated list of emails that are going to receive notifications on specific Job status events (start, failure, or success) for each Gem in Job Pipeline.

To change the Job name itself, go to Prophecy's metadata page. Locate the Job within a Project, and click the pencil icon.

Deployment Modes

To deploy a Job on Databricks, we need to release the project from Prophecy UI as shown in example below. As soon as the project is released, the Job would start appearing on Databricks Jobs page as well.


info

Make sure to enable the Job before creating a Release. If it is not enabled the Job will not run on the specified schedule.

If a Job's selected Fabric is changed it will create a separate Databricks job definition. The previous Job (with the previous Fabric) will be paused automatically and the new version will be scheduled.

Prophecy supports two different Job deployment models. Each has different impacts on Job cost and parallelism.

Multi Job Cluster Mode

In this mode, each component of Job will spawn a separate cluster of its own.

Here's how the Databricks UI looks for Prophecy's Multi Cluster Mode.

Multi Job Cluster

Single Job Cluster Mode

In this mode, each component of Job will run on the same cluster.

info

To use single cluster mode the package name across each Pipeline in Job should be unique. This is done to ensure that the folder structure for one Pipeline does not overwrite another. Please refer to the steps below in continuation to our earlier Example on how to configure package name in Pipeline.


Here's how the Databricks UI looks for Prophecy's Single Cluster Mode.

Single Job Cluster

Job Monitoring

Prophecy provides monitoring page which shows the status (enable/disable) of all the Jobs deployed via Prophecy and status of historic/current runs (success/failure/in-progress) for quick reference.

Guides

  1. How to trigger a job from another job?
  2. How to design a reliable CI/CD process?