Skip to main content

Repartition

Spark Gem

This will repartition or coalesce the input DataFrame based on the specified configuration. There are four different repartitioning options:

Hash Repartitoning

Repartitions the data evenly across various partitions based on the hash value of the specified key.

Parameters

ParameterDescriptionRequired
DataFrameInput DataFrameTrue
Overwrite default partitionsFlag to overwrite default partitionsFalse
Number of partitionsInteger value specifying number of partitionsFalse
Repartition expression(s)List of expressions to repartition byTrue

Generated Code

def hashRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5, col("customer_id"))

Random Repartitioning

Repartitions without data distribution defined.

Parameters

ParameterDescriptionRequired
DataFrameInput DataFrameTrue
Number of partitionsInteger value specifying number of partitionsTrue

Generated Code

def randomRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5)

Range Repartitoning

Repartitions the data with tuples having keys within the same range on the same worker.

Parameters

ParameterDescriptionRequired
DataFrameInput DataFrameTrue
Overwrite default partitionsFlag to overwrite default partitionsFalse
Number of partitionsInteger value specifying number of partitionsFalse
Repartition expression(s) with sortingList of expressions to repartition by with corresponding sorting orderTrue

Generated Code

def RepartitionByRange(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartitionByRange(5, col("customer_id").asc())

Coalesce

Reduces the number of partitions without shuffling the dataset.

Parameters

ParameterDescriptionRequired
DataFrameInput DataFrameTrue
Number of partitionsInteger value specifying number of partitionsTrue

Generated Code

def Coalesce(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.coalesce(5)

Video demo