Skip to main content

Execution on Databricks

Cluster Types

Databricks clusters come with various Access Modes. To use Unity Catalog Shared clusters, check for feature support here.

Interims

During development, often the user will want to see their data to make more sense of it and to check whether the expected output is getting generated or not after the transformation. Prophecy generates these data samples as Interims, which are temporarily cached previews of data after each Gem. To check more about interims, please refer here.

Depending on the type of clusters, we have two modes of Interims

Regular Interims

For Single User clusters, and No isolation Shared clusters, we have interims available after each Gem of Pipeline. These are available on both Unity and Non Unity catalog workspaces.

Regular Interims

Vanilla Interims

For Shared mode clusters, we have added interims on all Tail nodes of the Pipeline. These interims will come Just before Target Gems, and if there is no Target Gem, then as a dangling edge after last Gem. See below images for the same.

Vanilla Interims

Vanilla Interims

Execution Metrics

When running Pipelines and Jobs, you may be interested to know few metrics related to execution like records read/written, bytes read/written, total time taken and Data samples b/w components. These Dataset, Pipeline-run and Job-run related metrics are accumulated and stored on your data plane and can be viewed later from Prophecy UI. For more details, refer here.

caution

These metrics are not available for Shared mode clusters (both normal workspaces and Unity catalog workspaces). You should see a proper error when trying to get historical runs of Pipelines/Jobs executed on Shared mode clusters.

Refer below images for Execution Metrics on Pipelines page.

Pipeline_Execution_Metrics

Each row here is one run of the Pipeline. You can click and go to a particular run and see the interims for that run or metrics like Rows read/written, time taken, etc.

Execution_Metrics

You can also see Execution Metrics for each Dataset in the Pipeline.

Dataset_metrcis

Each row here is one run where this Dataset was used. You can click and go to a particular run and see more detailed insights on your data along with preview.

Dataset_stats

info

When using High Concurrency or Shared Mode Databricks Clusters you may notice a delay when running the first command, or when your cluster is scaling up to meet demand. This delay is due to Prophecy and Pipeline dependencies (Maven or Python packages) being installed. For the best performance, it is recommended that you cache packages in an Artifactory or on DBFS. Please contact us to learn more about this.