Developer

Reformat

Reformat is often use to edit one or more columns' values, often by simple expressions and functions. Often this involves expressions such as substring to extract parts of incoming string, case when statement to add columns summarizing conditions of incoming data. This is also used to rename columns and select a few columns.

Note: Any columns not selected explicitly are removed, use Select All in the left bar to add all columns by default.

When there are no columns selected, all columns are passed through to the output. However, once you start selecting some columns, only the selected columns are output.

Note: Implementation: Reformat converts to a SQL Select or in relational terms into a projection. However, when changing (add, edit, delete) one or two columns, one can use SchemaTransformer that uses underlying withColumn construct. However, Reformat is best for performance. For more details refer to Spark APIs

Example

Reformat gives you a spreadsheet like interface where you can write expressions based on incoming columns to produce new columns.

Expression Builder

There is an expression builder that enables one to quickly write expressions. Expression builder will suggest

  • Incoming columns
  • Inbuilt operators
  • In-built functions
  • User defined functions

The following image shows an expression for the column first_name being developed and the expression builder showing up to assist the user in this development. It is very easy to see what functions are available.

.

.

Code

This component is a simple select statement, however the expressions can become very large and we see tables of 1000+ columns frequently with this component being hundreds of lines long.


object PrepareComponent {

  def apply(spark: SparkSession, in: DataFrame): Reformat = {
    import spark.implicits._

    val out = in.select(
      datediff(current_date(), col("account_open_date")).as("account_length_days"),
      col("order_id"),
      col("customer_id"),
      col("amount"),
      col("first_name"),
      col("last_name"),
      col("phone"),
      col("email"),
      col("country_code"),
      col("account_flags")
    )

    out

  }

}
    

# Coming soon!