Skip to main content

Deduplicate

Spark Gem

Removes rows with duplicate values of specified columns.

Parameters

ParameterDescriptionRequired
DataframeInput dataframeTrue
Row to keep- Any: Keeps any one row among duplicates. Uses underlying dropDuplicates construct
- First: Keeps first occurrence of the duplicate row
- Last: Keeps last occurrence of the duplicate row
- Unique Only: Keeps rows that don't have duplicates
- Distinct Rows: Keeps all distinct rows. This is equivalent to performing a df.distinct() operation
Default is Any
True
Deduplicate columnsColumns to consider while removing duplicate rows (not required for Distinct Rows)True
Order columnsColumns to sort Dataframe on before de-duping in case of First and Last rows to keepFalse

Examples


Rows to keep - Any

Example usage of Deduplicate

def dedup(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.dropDuplicates(["tran_id"])

Rows to keep - First

Example usage of Deduplicate - First

def earliest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"row_number",
row_number()\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())
)\
.filter(col("row_number") == lit(1))\
.drop("row_number")

Rows to keep - Last

Example usage of Deduplicate - Last

def latest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"row_number",
row_number()\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())
)\
.withColumn(
"count",
count("*")\
.over(Window\
.partitionBy("customer_id")
)\
.filter(col("row_number") == col("count"))\
.drop("row_number")\
.drop("count")

Rows to keep - Unique Only

Example usage of Deduplicate - Unique

def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"count",
count("*")\
.over(Window\
.partitionBy("customer_id")
)\
.filter(col("count") == lit(1))\
.drop("count")

Rows to keep - Distinct Rows

Example usage of Deduplicate - Distinct

def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.distinct()