Skip to main content

SchemaTransform

Spark Gem

Use the SchemaTransform gem to apply transformations to columns from the incoming DataFrame. This page describes the different transformations—or operations—that you can use in this gem.

Example usage of SchemaTransform

info

Unlike Reformat which is a set operation where all the transforms are applied in parallel, transformations here are applied in order. Reformat is a SQL select and is preferable when making many changes.

Operations

Add/Replace Expression

Add a new column or replace an existing one based on an expression.

ParameterDescription
New ColumnOutput column name
ExpressionSQL expression to generate values in the new column

Drop Column

Remove a column from next stage of the Pipeline.

ParameterDescription
Column to dropColumn to be dropped

Rename Column

Rename an existing column downstream in the Pipeline.

ParameterDescription
Old Column NameColumn to be renamed
New Column NameOutput column name

Add If Missing

Provide a default value for a column if it is missing from the source.

ParameterDescription
Source Column NameColumn that contains missing values
Default Value (if missing)The value that will replace missing values

Add Rule

Use a business rule in your Pipeline. Visit the Business rules engine page to learn about business rules.

ParameterDescription
New ColumnThe column that the business rule will apply to
RuleThe business rule that contains the logic that will populate the new column values

Spark Code

def transform(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn("business_date", to_date(lit("2022-05-05"), "yyyy-MM-dd"))\
.withColumnRenamed("bonus_rate", "bonus")\
.drop("slug")

Advanced Import

The Advanced Import feature allows you to bulk import statements that are structured similarly to CSV/TSV files. This can be useful if you have your expressions/transformation logic in another format and just want to quickly configure a SchemaTransform gem based on existing logic.

Using Advanced Import

  1. Click the Advanced tab in the SchemaTransform gem.
  2. Enter the expressions into the text area.

Advanced import mode

Format

The format of these expressions is op_type,target_name,target_expr, where op_type is the type of operation (see below); target_name is the desired new column name and target_expr is the Spark expression that will be used to generate the new column. Each op_type has a different number of extra columns that have to be provided, see below for more details.

caution

For target_expr values that contain a comma , or span multiple lines, you must surround them by `` on either side. For example:

addrep,customer_id,customer_id
addrep,full_name,``concat(first_name, ' ', last_name)``

Advanced Import Operation types

Operation TypeAdvanced Import name:Arguments:Example
Add/Replaceaddrep2addrep,foo,CAST(NULL as int)
Dropdrop1drop bar
Renamerename2rename,foo,bar
Add if missingmissing2missing,foo,current_timestamp()
Add ruleaddrule1addrule, PromoCodeRule()