Repartition gem

ProphecySparkBasicsPython 0.0.1+ProphecySparkBasicsScala 0.0.1+UC Dedicated Cluster 14.3+UC Standard Cluster 14.3+Livy 3.0.1+

This will repartition or coalesce the input DataFrame based on the specified configuration. There are four different repartitioning options:

Hash Repartitoning

Repartitions the data evenly across various partitions based on the hash value of the specified key.

Parameters

Parameter	Description	Required
DataFrame	Input DataFrame	True
Overwrite default partitions	Flag to overwrite default partitions	False
Number of partitions	Integer value specifying number of partitions	False
Repartition expression(s)	List of expressions to repartition by	True

Compiled code

Python
Scala

def hashRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
    return in0.repartition(5, col("customer_id"))

object hashRepartition {

  def apply(spark: SparkSession, in: DataFrame): DataFrame =
    in.repartition(5, col("customer_id"))

}

Random Repartitioning

Repartitions without data distribution defined.

Parameters

Parameter	Description	Required
DataFrame	Input DataFrame	True
Number of partitions	Integer value specifying number of partitions	True

Compiled code

Python
Scala

def randomRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
    return in0.repartition(5)

object randomRepartition {

  def apply(spark: SparkSession, in: DataFrame): DataFrame =
    in.repartition(5)

}

Range Repartitoning

Repartitions the data with tuples having keys within the same range on the same worker.

Parameters

Parameter	Description	Required
DataFrame	Input DataFrame	True
Overwrite default partitions	Flag to overwrite default partitions	False
Number of partitions	Integer value specifying number of partitions	False
Repartition expression(s) with sorting	List of expressions to repartition by with corresponding sorting order	True

Compiled code

Python
Scala

def RepartitionByRange(spark: SparkSession, in0: DataFrame) -> DataFrame:
    return in0.repartitionByRange(5, col("customer_id").asc())

object RepartitionByRange {

  def apply(spark: SparkSession, in: DataFrame): DataFrame =
    in.repartitionByRange(5, col("customer_id").asc())

}

Coalesce

Reduces the number of partitions without shuffling the dataset.

Parameters

Parameter	Description	Required
DataFrame	Input DataFrame	True
Number of partitions	Integer value specifying number of partitions	True

Compiled code

Python
Scala

def Coalesce(spark: SparkSession, in0: DataFrame) -> DataFrame:
    return in0.coalesce(5)

object Coalesce {

  def apply(spark: SparkSession, in: DataFrame): DataFrame =
    in.coalesce(5)

}

Hash Repartitoning​

Parameters​

Compiled code​

Random Repartitioning​

Parameters​

Compiled code​

Range Repartitoning​

Parameters​

Compiled code​

Coalesce​

Parameters​

Compiled code​

Video demo​

Hash Repartitoning

Parameters

Compiled code

Random Repartitioning

Parameters

Compiled code

Range Repartitoning

Parameters

Compiled code

Coalesce

Parameters

Compiled code

Video demo