Databricks Jobs
Once you have developed a Spark Pipeline using Prophecy, you will want to schedule it to run at some frequency. To support this, Prophecy provides a visual layer on top of Databricks Jobs for an easy orchestration.
Development
Your first Job
You can create a Job from two places. If you're going to schedule only a single Pipeline, the easiest way to build a Job for it is to do it directly from the Pipeline editor screen. This way your Job is automatically initialized with the Pipeline you create it from.
To do that, simply navigate to your Pipeline, and click on the Schedule button (1). That opens a modal that shows all the Jobs that refer to this Job or allow you to create a completely new Job from scratch. Upon clicking Create New (2) you are redirected to the Job building page.
Alternatively, if you'd like to create a new Job completely from scratch, you can do that directly from the entity creation page (1). There you can choose the Job tile (2) and that opens a similar modal where you can define your Job details (3).
Whenever, creating a new Job you're asked for the following details:
Field Name | Description |
---|---|
Project | Which Project to create the Job in. This controls who has access to the Job, groups Jobs together for lineage, and allows you to use Pipelines already published within that Project. |
Branch | Which Git branch to use when developing this Job. |
Name | Unique Job name. |
Scheduler | The underlying engine that's going to execute your Job. Databricks is recommended. |
Fabric | The execution Fabric to which the Job is going to be deployed. |
Job Size | The default size of the cluster that's going to be created for the Job to run. |
Schedule Interval | Defines how often your Job is going to run. The interval is defined using the Quartz format. You can click on the clock icon to select the interval. |
Alerts email | Comma separated list of emails that are going to receive notifications on specific Job status events (start, failure, or success). |
Building the Job
Now that you've created your first Job, you can start adding Gems to the canvas to define which Pipelines will be run during the Job. To define dependencies between the Pipelines within the Job you can simply connect them by dragging-and-dropping the edges between Gems.
Two Gem types are available when defining Databricks Jobs:
Pipeline Gem
The Pipeline Gem triggers a Spark Pipeline developed in Prophecy.
Settings for Pipeline component can be inherited from overall Job configuration or can be set inside the component itself.
Script Gem
Script Gem can be used to write any ad-hoc code.
Settings for script component can be inherited from overall Job configuration or can be set inside the component itself.
Visual == Code
The visual graph created on the Jobs page is automatically converted to code (JSON) in the backend which gets committed to Git.
Job Configuration
Field Name | Description |
---|---|
Scheduler | The underlying engine that's going to execute your Job. Databricks is recommended. |
Fabric | The execution Fabric to which the Job is going to be deployed. |
Cluster Size | The default size of the cluster that's going to be created for the Job to run. |
Cluster Mode | Can be selected as Single (all Gems within the Job re-use the same Cluster) or Multi (all Gems within the Job run on a separate new cluster) |
Schedule Interval | Defines how often your Job is going to run. The interval is defined using the Quartz format. You can click on the clock icon to select the interval. |
Pipeline level Timeout | Timeout at Pipeline level |
Alerts Email for Pipeline | Comma separated list of emails, that are going to receive notifications on specific Job status events (Job start, failure, or success) for entire Pipeline. |
Per Gem Timeout | Timeout for each Gem in Job Pipeline |
Number of retries per Gem | Number of retries for each Gem in Job Pipeline |
Alerts Email per Gem | Comma separated list of emails that are going to receive notifications on specific Job status events (start, failure, or success) for each Gem in Job Pipeline. |
To change the Job name itself, go to Prophecy's metadata page. Locate the Job within a Project, and click the pencil icon.
Deployment Modes
To deploy a Job on Databricks, we need to release the project from Prophecy UI as shown in example below. As soon as the project is released, the Job would start appearing on Databricks Jobs page as well.
Make sure to enable the Job before creating a Release. If it is not enabled the Job will not run on the specified schedule.
If a Job's selected Fabric is changed it will create a separate Databricks job definition. The previous Job (with the previous Fabric) will be paused automatically and the new version will be scheduled.
Prophecy supports two different Job deployment models. Each has different impacts on Job cost and parallelism.
Multi Job Cluster Mode
In this mode, each component of Job will spawn a separate cluster of its own.
Here's how the Databricks UI looks for Prophecy's Multi Cluster Mode.
Single Job Cluster Mode
In this mode, each component of Job will run on the same cluster.
To use single cluster mode the package name across each Pipeline in Job should be unique. This is done to ensure that the folder structure for one Pipeline does not overwrite another. Please refer to the steps below in continuation to our earlier Example on how to configure package name in Pipeline.
Here's how the Databricks UI looks for Prophecy's Single Cluster Mode.
Job Monitoring
Prophecy provides monitoring page which shows the status (enable/disable) of all the Jobs deployed via Prophecy and status of historic/current runs (success/failure/in-progress) for quick reference.