Skip to main content

Gems

If a Pipeline is a roadmap of the journey that your data will take from Start to Finish, the Gems are the stops to make along the way. Each Gem instance comes with their own configuration and each produces its own block of output code. Each Gem instance can be seen as just another Spark DataFrame.

Gem UI

Since Gems are so integral to working with Prophecy, there are a number of UI components that are related to working with them.

Gem Drawer

Gem Drawer

In the Pipeline editor you'll find the Gem Drawer. This organizes the Gems into one of several categories:

info

The Gem list will depend on two factors: Your project language (Python/Scala) and if you are using SaaS Prophecy or have it deployed in your own architecture.

Gem categoryDefinition
Source/TargetGems related to Sources/Targets and Lookups
TransformGems related to the transformation of your data
CustomCustom gems and other gems that don't fit into the other categories
Join/SplitGems related to splitting or joining datasets together.
SubgraphUse published subgraphs in your Pipeline

Gem Instance

Once you've selected which Gem you want to use in your pipeline from the Drawer, an Instance of the Gem will appear in the Pipeline Editor.

Gem Instance

UI element nameDescription
1Gem Instance nameThe name of this particular instance. It must be unique within a given Pipeline. You can click this label to edit it.
2Gem Type nameThe type of gem
3Error IndicatorThe error state of the Gem. If there's something wrong with the configuration of the gem this indicator will appear.
4Input PortsInput ports that will accept connections from upstream Gems. If this Gem type supports multiple or editable inputs, more connections will appear here.
5Output PortsOutput ports that can be used with downstream Gems. If this Gem type supports multiple or editable outputs, more connections will appear here.
6Gem PhaseThe Phase for this instance. Used to define the order in which Gem instances are executed
7OpenOpen the UI for this Gem Instance
8Run ButtonRuns this Gem, including all upstream Gems that are required.
9Action menuShows the Quick Action menu which allows you to change the Phase, Caching or to Delete the instance.

Gem Configuration

Gem instances can be configured by hovering over their icons in the Pipeline Editor and clicking Open.

Gem instance configuration

UI element nameDescription
1Gem Instance nameThe name of this particular instance. It must be unique within a given Pipeline.
2Inputs/OutputsInputs and outputs for this Gem instance. See here for more information
3Gem ConfigurationConfiguration for this instance. Each Gem Type will have a different UI. See the documentation for each Gem type for more information.
4DiagnosticsIf there's a problem with the configuration for this Gem instance, clicking here will show a list of configuration errors.
5Unit TestsEach Gem instance can have its own set of Unit tests. See here for more details

Input/Output ports

Inputs and outputs define the connections going in to and coming out from a particular Gem Instance. Some Gem types support multiple inputs or multiple outputs, depending on their configuration.

Inputs

Inputs define the incoming connections accepted by the Gem. Most Gem types only accept one connection, but some (Such as Join) allow for as many inputs as you want.

Input UI

UI element nameDescription
1SearchFilter fields across all inputs
2Add InputIf the Gem Type supports it, you can click this button to add more input ports to this instance
3Input nameIf the Gem Type supports it, you can click the pencil icon to rename this port. Some Gem Types will use this name as part of its configuration. For example, a port named input0 can be used in Join for the join conditions.
4Port schemaThe fields and schema types of the port. Will only appear when an upstream Gem instance is connected

Outputs

Outputs define the outgoing schema(s) that will be available to downstream Gem instances. In some cases the Prophecy compiler can't infer the output schema automatically, so we've provided an option to try inferring the schema using your connected Fabric or just specifying it manually.

Output port definition

UI element nameDescription
1Output schemaOutput schema for this Gem instance. This will be the schema of the data that downstream Gem instances will use
2Custom schemaToggle this to enable custom output schema editing
3Infer from clusterRun the Gem code on the connected cluster and infer the schema from the result
4Edit schemaEdit the output schema manually

Port renaming

Most Gem types allow Inputs and Outputs to be renamed, which will have at least two effects: Renaming the input variable in the generated code and change the port name in the Pipeline Editor.

Port rename Port name in pipeline editor

  def Join_1(spark: SparkSession, input_one: DataFrame, in1: DataFrame, ) -> DataFrame:
...

Phase

Gem phase indicator

A Gem's Phase in a Pipeline controls the output order for the code generated for the Pipeline. Gem A with a Phase of 0 will run before Gem B with a Phase of 1. It can be any Integer, positive or negative. Let's see an example in action.

Example

Gem phase example 1

Here we have a Pipeline with a number of Gems, each with the default Phase of 0. Let's look at what the generated code is for this version of the Pipeline:

def apply(spark: SparkSession): Unit = {
val df_my_orders = my_orders(spark).cache()
val df_Repartition_1 = Repartition_1(spark, df_my_orders)
Write_CSV(spark, df_Repartition_1)
val df_SchemaTransform_1 = SchemaTransform_1(spark, df_my_orders)
}

So the order of operations is my_orders, Repartition_1 (and its downstream Gem Write_CSV), then Schema_Transform1. If we wanted to run Schema_Transform1 first instead, we can change Repartition_1's Phase to be a higher number than Schema_Transform1's Phase. The Change Phase button can be found under the ... menu that will appear when a Gem is selected:

Gem change phase button

Let's see how the code has changed.

def apply(spark: SparkSession): Unit = {
val df_my_orders = my_orders(spark).cache()
val df_Repartition_1 = Repartition_1(spark, df_my_orders)
Write_CSV(spark, df_Repartition_1)
val df_SchemaTransform_1 = SchemaTransform_1(spark, df_my_orders)
}

Not much has changed, because Write_CSV still has a Phase of 0, and in order to be able to complete that step of the Pipeline all of the upstream steps required to complete Write_CSV (in this case, Schema_Transform1) have to be completed first. Let's change the Phase of Write_CSV.

Write_CSV with a phase of 1

And the new code:

def apply(spark: SparkSession): Unit = {
val df_my_orders = my_orders(spark).cache()
val df_SchemaTransform_1 = SchemaTransform_1(spark, df_my_orders)
val df_Repartition_1 = Repartition_1(spark, df_my_orders)
Write_CSV(spark, df_Repartition_1)
}

Much better!

So, in summary: the Phase of Leaf Nodes (that is, the final Gem in a given branch of a Pipeline) is the Phase that will dictate the order of the generated code.

Caching