Skip to main content

Aggregate

Allows you to group the data and apply aggregation methods and pivot operation.

Parameters

ParameterDescriptionRequired
DataFrameInput DataFrameTrue
Target column (Aggregate Tab)Output column name of aggregated columnTrue
Expression (Aggregate Tab)Aggregate function expression
Eg: sum("amount"), count(*), avg("amount")
True
Target column (Group By Tab)Output column name of grouped columnRequired if Pivot Column is present
Expression (Group By Tab)Column expression to group on
Eg: col("id"), month(col("order_date"))
Required if a Target Column(Group By) is present
Pivot columnColumn name to pivotFalse
Unique valuesList of values in Pivot Column that will be translated to columns in the output DataFrameFalse
info

Providing Unique values while performing pivot operation improves the performance of the operation since Spark does not have to first compute the list of distinct values of Pivot Column internally.

Examples

Aggregation without Grouping

Example usage of Aggregate - Aggregation without Grouping

def total_orders(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.agg(count(lit(1)).alias("number_of_orders"))

Aggregation with Grouping

Example usage of Aggregate - Aggregation with Grouping

def orders_by_date(spark: SparkSession, in0: DataFrame) -> DataFrame:
df1 = in0.groupBy(concat(month(col("order_date")), lit("/"), year(col("order_date")))
.alias("order_month(MM/YYYY)"))
return df1.agg(count(lit(1)).alias("number_of_orders"))

Pivot

Example usage of Aggregate - Pivoting

def orders_by_date_N_status(spark: SparkSession, in0: DataFrame) -> DataFrame:
df1 = in0.groupBy(concat(month(col("order_date")), lit("/"), year(col("order_date"))).alias("order_month(MM/YYYY)"))
df2 = df1.pivot("order_status", ["Approved", "Finished", "Pending", "Started"])
return df2.agg(count(lit(1)).alias("number_of_orders"))