Interactive Execution
Running a Pipeline
There are 2 ways to run a Pipeline interactively:
- Using the play button from the bottom right side. This would execute the entire Pipeline.
- Using the play button on a particular Gem. This would only execute the flow in the Pipeline up to and including that Gem. This comes in handy during development, so that we don't have to run the entire Pipeline to debug/change a particular Gem.
Interims
During development, often the user will want to see their data to make more sense of it and to check whether the expected output is getting
generated or not after the transformation. Prophecy generates these data samples as Interims
, which are temporarily cached previews of data after each Gem.
Which Gems automatically get Interims created is controlled by the Pipeline settings as shown below.
From the Pipeline, select the (1) dropdown and (2) Pipeline Settings. Select (3) Job Sampling to generate interim samples for scheduled jobs. Select (4) Sampling Mode to chose the level of data sampling. Select (5) Data Sampling to generate interim samples during interactive runs, and select the Sampling Mode accordingly. These two options, Job sampling and Data sampling, are independent; one does not affect the other. For Job sampling, the interim metrics are stored in the compute cluster, such as the Databricks workspace, and visible in execution metrics.
Advanced Data sampling setting
There is also a global level Development Settings flag that admins can use to disable Data sampling for a given Fabric. This flag overrides the Pipeline level Data sampling settings. When disabled, you won't be able to see production data in the interims when you run the Pipeline.
From the Metadata page, click the Fabrics tab and select the Fabric that you want to change the Data sampling setting for. Click the Advanced tab and click the Allow for data sampling toggle to turn on or off the flag.
Data sampling is enabled on by default. When left enabled, Data sampling uses the Pipeline's Data sampling settings. Prophecy samples data during the interactive run experience to provide the best debugging experience for users.
Data sampling modes
Toggle the images below to view the various modes (or levels) of data sampling. By default, for interactive runs, data sampling is enabled for all components. Note Vanilla is an interim sampling mode reserved for Shared Databricks clusters.
Data Sampling Mode - All
Execution
Once we run a Pipeline, we have several options to better understand our Pipeline:
Execution Code
Once we run a Pipeline interactively Prophecy generates the execution code in the backend, which is then executed in the selected Fabric.
Execution code can also be copy-pasted inside databricks notebook
or shell and can directly
be executed for debugging.
Execution Errors
If there are any errors in the Pipeline, a pop-up window will open for execution errors
.
Also the error can be seen in the runtime logs:
Runtime Logs
Overall progress with associated timestamps can be monitored from the Runtime Logs as shown here:
Runtime Metrics
Various Spark metrics collected during runtime can be monitored as shown here:
Execution Metrics
For interactive runs
execution metrics are collected to make the development easier and performance tuning more intuitive. These can be
accessed from the Metadata Page
inside the run tab
of the Pipeline
.
Shell
Prophecy IDE comes with an inbuilt interactive Spark shell that supports both Python
and Scala
. The shell is an easy way to quickly analyze data or test Spark commands
Spark context
and session
are available within the shell as variables sc
and spark
respectively
Examples
You need to be connected to a cluster to access the interactive shell