Configuration
Allows you to define configurations to control various aspects of your Pipeline.
Prophecy IDE allows you to define three kinds of configurations:
Spark Configuration
Set runtime Spark configurations as name-value pairs. The name-value pairs will be set inside the Spark runtime configurations as spark.conf.set(name, value)
Hadoop Configuration
Hadoop configurations as name-value pairs. The name-value pairs will be set inside the Hadoop configuration as spark.sparkContext.hadoopConfiguration.set(name, value)
Pipeline Configuration
Config values which can be set at Pipeline level and then be accessed inside any component in the Pipeline. Multiple instances
of configuration can be created per Pipeline.
Syntax for using configuration inside Gems:
For visual language SQL : '$config_name'
For visual language Scala/Python : Config.config_name
For using Spark expression with visual language SQL : expr('$config_name')
For using Spark expression with visual language Scala/Python : expr(Config.config_name)
Examples for Pipeline level configurations
Now let's use the above defined configurations in the below Pipeline.
Using Config in limit Gem
SQL Visual Language
In the below image '$num_top_customers'
is fetching the integer value defined in configurations.
Scala/Python Visual Language
In the below image Config.num_top_customers
is fetching the integer value defined in configurations.
Using Spark-expression Config type in Gem
Here we have used Spark expression directly from our config value to populate a column.
SQL Visual Language
In the below image:
(1) amounts
-> expr('$test_expression')
is coming from configuration type defined as Spark-expression
(2) report_name
-> '$report_name'
is coming configuration type defined as string
Scala/Python Visual Language
In the below image:
(1) amounts
-> expr(Config.test_expression)
is coming from configuration type defined as Spark-expression
(2) report_name
-> Config.report_name
is coming configuration type defined as string
Similarly configurations defined as type Spark-expression
can be used directly in filter, join, reformat etc Gems directly.
Using config in paths for Source/Target Gems
Config can also be used to refer to paths. This type of configuration comes in handy in situation where we have DEV, QA and PROD data. And want to configure Dataset (or in general the Job runs) based on which environment we are running it in.
Pipeline Configuration instances
Different configuration instances can be defined as per requirement. This comes in handy when Pipeline needs to run with different configurations in different environments or different users.
New instances can be configured to override default values as shown in image below:
Using a particular configuration instance for interactive runs
For interactive runs, configuration can be selected as shown in image below.
Using configuration instances in Jobs
Particular instances can also be configured in Databricks Jobs.
Overriding configuration values in Jobs
Specific values from configuration instance can be overridden as shown in images below:
Code
All configuration instances and values are automatically converted to code as well. Default configurations are stored as code and specific instance overrides are stored as JSON files as shown in image below.
Scala Config code
Python Config code
Component code
- Python
- Scala
def Reformat(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.select(
col("customer_id"),
col("orders"),
col("account_length_days"),
expr(Config.test_expression).as("amounts"),
lit(Config.report_name).as("report_name")
)
object Reformat {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.select(col("customer_id"),
col("orders"),
col("account_length_days"),
expr(Config.test_expression).as("amounts"),
lit(Config.report_name).as("report_name")
)
}
📄️ Conditional Execution in Spark
Guide on conditional Execution of a Gem