Skip to main content

SchemaTransform

Spark Gem

SchemaTransform is used to add, edit, rename or drop columns from the incoming DataFrame.

info

Unlike Reformat which is a set operation where all the transforms are applied in parallel, transformations here are applied in order. Reformat is a SQL select and is preferable when making many changes.

Parameters

ParameterDescriptionRequired
DataFrameInput DataFrameTrue
OperationAdd/Replace Column, Rename Column and Drop ColumnRequired if a transformation is added
New ColumnOutput column name (when Add/Replace operation is selected)Required if Add/Replace Column is selected
ExpressionExpression to generate new column (when Add/Replace operation is selected)Required if Add/Replace Column is selected
Old Column NameColumn to be renamed (when Rename operation is selected)Required if Rename Column is selected
New Column NameOutput column name (when Rename operation is selected)Required if Rename Column is selected
Column to dropColumn to be dropped (when Drop operation is selected)Required if Drop Column is selected

Operation types

Operation TypeDescription
Add/ReplaceAdd a new column or replace an existing one based on an expression
DropRemoves a single column from the next stages of the pipeline. This is useful if you need 9 out of 10 columns, for example.
RenameRenames an existing column
Add if MissingProvide a default value for a column if it's missing from the source. For example, if reading from a CSV file daily and want to ensure a column has a value even if it's not in the source files use this option.

Example

Example usage of SchemaTransform

Spark Code

def transform(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn("business_date", to_date(lit("2022-05-05"), "yyyy-MM-dd"))\
.withColumnRenamed("bonus_rate", "bonus")\
.drop("slug")

Advanced Import

The Advanced Import feature allows you to bulk import statements that are structured similarly to CSV/TSV files. This can be useful if you have your expressions/transformation logic in another format and just want to quickly configure a SchemaTransform Gem based on existing logic.

Using Advanced Import

  1. Click the Advanced button in the SchemaTransform Gem UI

Advanced import toggle

  1. Enter the expressions into the text area using the format as described below:

Advanced import mode

  1. Use the button at the top (labeled Expressions) to switch back to the expressions view. This will translate the expressions from the CSV format to the table format and will show any errors detected.

Format

The format of these expressions is op_type,target_name,target_expr, where op_type is the type of operation (see below); target_name is the desired new column name and target_expr is the Spark expression that will be used to generate the new column. Each op_type has a different number of extra columns that have to be provided, see below for more details.

caution

For target_expr values that contain a comma , or span multiple lines, you must surround them by `` on either side. For example:

addrep,customer_id,customer_id
addrep,full_name,``concat(first_name, ' ', last_name)``

Advanced Import Operation types

Operation TypeAdvanced Import name:Arguments:Example
Add/Replaceaddrep2addrep,foo,CAST(NULL as int)
Dropdrop1drop bar
Renamerename2rename,foo,bar
Add if missingmissing2missing,foo,current_timestamp()