Skip to main content

Gems

Gems are functional units in a pipeline that perform tasks such as reading, transforming, writing, or handling other data operations.

Categories

The table below outlines the different Spark gem categories.

CategoryDescription
Source and TargetThe set of gems that help with loading and saving data.
TransformThe set of gems that help with transforming data.
Join and SplitThe set of gems that help with the process of merging or splitting DataFrame(s) to create new DataFrame(s).
CustomThe set of gems that our creative teams build using Expression Builder to extend the Prophecy capabilities.
Machine LearningThe set of gems that prepare data or use data for Machine Learning.
SubgraphA gem that can contain many other gems within it.

Gem instance

When you click on a gem from the gem drawer, an instance of that gem gets added to your pipeline canvas. Use the image and the table below to understand the UI of a gem.

Gem Instance

CalloutUI elementDescription
1Gem labelThe name of this particular gem instance. It must be unique within a given pipeline.
2Gem type nameThe type of gem.
3Input portsOne or more ports that accept connections from upstream gems.
4Output portsOne or more ports that connect to downstream gems.
5Gem phaseThe phase for this gem instance, which defines the order in which gem instances are executed.
6OpenThe button that lets you open the gem configuration.
7Run buttonA button that runs the pipeline up to and including the gem.
8Action menuA menu that includes options to change the phase of the gem, add run conditions, delete the gem, and more.

Gem configuration

When you open a gem, you can configure how the gem will work.

Gem instance configuration

UI elementDescription
1Gem instance nameThe name of this particular gem instance.
2Inputs and outputsThe inputs and outputs for this gem instance.
3Gem configurationThe configuration for this instance. Each gem will be different. See the documentation for individual gems for more information.
4Run buttonA button that runs the pipeline up to and including the gem.
5DiagnosticsA diagnostics window that will show a list of configuration errors if they exist.
6DataA preview of the output table that is available if you run the gem.
7Unit testsA set of unit tests. See here for more details.

Input ports

The Input tab defines the incoming connections accepted by the gem. Most gem types only accept one connection, but some (such as Join) allow for multiple inputs.

Input UI

UI elementDescription
1SearchA field that will filter your input columns.
2Add InputIf the gem supports multiple inputs, you can click this button to add more input ports to this instance.
3InputThe name of the input port and the name of the input gem instance.
4Port schemaThe schema of the port (columns and column types). The schema will only appear when an input port is present.
5Edit portsA button that lets you edit or delete ports.

Output ports

The Output tab defines the outgoing schemas that will be available to downstream gems. In some cases, the Prophecy compiler can't infer the output schema automatically, so you have the option to infer the schema using your connected fabric or specify it manually.

UI elementDescription
1Output schemaOutput schema for this gem instance. This will be the schema of the data that downstream gem instances will use
2Custom schemaToggle this to enable custom output schema editing
3Infer from clusterRun the gem code on the connected cluster and infer the schema from the result
4Edit schemaEdit the output schema manually

Action menu

The action menu gives you more granular control over individual gems. When you expand the action menu, you see the following options:

ActionDescription
ExplainCopilot adds a comment to the gem that explains what it does.
FixCopilot resolves an error in the gem configuration.
LabelCopilot renames the gem.
Add CommentManually write a comment that explains the gem.
Add ConditionWrite a condition that will restrict gem execution during pipeline runs.
Change PhaseChange the phase of the gem.
DeleteRemove the gem from the pipeline.
CacheCache the entire gem output dataset. This is useful if you have multiple pipeline branches from one gem. If you cache the dataset from that gem, Spark will not have to compute upstream gems multiple times for each branch.
Data PreviewIf selective data sampling is enabled, use this checkbox to enable or disable the generation of the data sample for the gem.
Data ProfileEnable or disable data profiling for the gem output.
Data DiffIf Data Diff is configured for a Target gem, use this checkbox to enable or disable the data diff output.

Gem action menu

Gem phase

In a data pipeline, the phase of a gem determines the sequence in which it runs. This sequencing is managed by adjusting the order of the generated pipeline code.

def apply(spark: SparkSession): Unit = {
val df_my_orders = my_orders(spark).cache()
val df_Repartition_1 = Repartition_1(spark, df_my_orders)
Write_CSV(spark, df_Repartition_1)
val df_SchemaTransform_1 = SchemaTransform_1(spark, df_my_orders)
}

Here’s how it works:

  • Gems are assigned a numerical phase (e.g., 0, 1, -1), where lower values run first. For example, a gem with phase 0 will execute before a gem with phase 1.
  • When a gem runs, all its upstream gems must also run. This means that if a downstream gem has phase 0 and an upstream gem has phase 1, the upstream gem will be grouped into phase 0 to ensure proper execution.
  • Because of this dependency, the phase assigned to the last gem in a branch determines the phase of the entire branch. This means that when configuring gem phases, you only need to focus on the leaf nodes—the final gems in each branch of the pipeline.