Flatten Schema

When processing raw data it can be useful to flatten complex data types like Structs and Arrays into simpler, flatter schemas.

The FlattenSchema gem

The Input

FlattenSchema works on DataFrames that have nested columns that you'd like to extract into a flat schema.

For example, with an input schema like so:

Input schema

And the data looks like so:

Input data

We want to extract count, and all of the columns from the structs in events into a flattened schema.

The Expressions

Having added a FlattenSchema Gem to your Pipeline, all you need to do is click the column names you wish to extract and they'll be added to the Expressions section. Once added you can change the Target Column for a given row to change the name of the Column in the output.

Adding Expressions

The Columns Delimiter dropdown allows you to control how the names of the new columns are derived. Currently dashes and underscores are supported.

The Output

If we check the Output tab in the Gem, you'll see the schema that we've created using the selected columns.

Output schema

And here's what the output data looks like:

Output interim

No more nested structures!

info

For more advanced use cases, the Spark explode function is available to use in the Reformat Gem, Custom SQL Gem, or anywhere else that accepts Spark expressions.

The Input​

The Expressions​

The Output​

The Input

The Expressions

The Output