Low Code Spark

Aggregate

Aggregate component allows you to group together data by certain columns and aggregate rows for which the group by columns are the same.

  • Group by with Aggregate Expressions
  • Pivots (documentation coming soon!)

Example

We’ll use the example of a TPC-DS query 19 that aggregates data (in GROUP BY). We note that it is aggregating data by brand from each manufacturer. Also we note that it is computing the sum of sales prices for the brands - Sum(ss_ext_sales_price)

SELECT i_brand_id              brand_id, 
               i_brand                 brand, 
               i_manufact_id, 
               i_manufact, 
               Sum(ss_ext_sales_price) ext_price 
FROM   date_dim, 
       store_sales, 
       item, 
       customer, 
       customer_address, 
       store 
WHERE  d_date_sk = ss_sold_date_sk 
       AND ss_item_sk = i_item_sk 
       AND i_manager_id = 38 
       AND d_moy = 12 
       AND d_year = 1998 
       AND ss_customer_sk = c_customer_sk 
       AND c_current_addr_sk = ca_address_sk 
       AND Substr(ca_zip, 1, 5) <> Substr(s_zip, 1, 5) 
       AND ss_store_sk = s_store_sk 
GROUP  BY i_brand, 
          i_brand_id, 
          i_manufact_id, 
          i_manufact 
ORDER  BY ext_price DESC, 
          i_brand, 
          i_brand_id, 
          i_manufact_id, 
          i_manufact
LIMIT 100; 

Equivalent query is built into Prophecy visually, and looks like this

Let’s open the SalesForBrands component, and look at the Group By tab:

.

Here is the Aggregate tab

.

Code

Here is the code for an aggregate


object SalesForBrands {

  def apply(spark: SparkSession, in: DataFrame): Aggregate = {
    import spark.implicits._

    val dfGroupBy = in.groupBy(col("brand").as("brand"),
                               col("brand_id").as("brand_id"),
                               col("i_manufact_id").as("i_manufact_id"),
                               col("i_manufact").as("i_manufact")
    )
    val out = dfGroupBy.agg(max(col("ss_ext_sales_price")).as("ext_price"))

    out

  }

}
    

# Coming soon!