Repartition
Spark Gem
This will repartition or coalesce the input DataFrame based on the specified configuration. There are four different repartitioning options:
Hash Repartitoning
Repartitions the data evenly across various partitions based on the hash value of the specified key.
Parameters
Parameter | Description | Required |
---|---|---|
DataFrame | Input DataFrame | True |
Overwrite default partitions | Flag to overwrite default partitions | False |
Number of partitions | Integer value specifying number of partitions | False |
Repartition expression(s) | List of expressions to repartition by | True |
Generated Code
- Python
- Scala
def hashRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5, col("customer_id"))
object hashRepartition {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.repartition(5, col("customer_id"))
}
Random Repartitioning
Repartitions without data distribution defined.
Parameters
Parameter | Description | Required |
---|---|---|
DataFrame | Input DataFrame | True |
Number of partitions | Integer value specifying number of partitions | True |
Generated Code
- Python
- Scala
def randomRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5)
object randomRepartition {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.repartition(5)
}
Range Repartitoning
Repartitions the data with tuples having keys within the same range on the same worker.
Parameters
Parameter | Description | Required |
---|---|---|
DataFrame | Input DataFrame | True |
Overwrite default partitions | Flag to overwrite default partitions | False |
Number of partitions | Integer value specifying number of partitions | False |
Repartition expression(s) with sorting | List of expressions to repartition by with corresponding sorting order | True |
Generated Code
- Python
- Scala
def RepartitionByRange(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartitionByRange(5, col("customer_id").asc())
object RepartitionByRange {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.repartitionByRange(5, col("customer_id").asc())
}
Coalesce
Reduces the number of partitions without shuffling the dataset.
Parameters
Parameter | Description | Required |
---|---|---|
DataFrame | Input DataFrame | True |
Number of partitions | Integer value specifying number of partitions | True |
Generated Code
- Python
- Scala
def Coalesce(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.coalesce(5)
object Coalesce {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.coalesce(5)
}