Skip to main content

Join

Spark Gem

Joins 2 or more DataFrames based on the given configuration.

Parameters

ParameterDescriptionRequired
DataFrame 1First input DataFrameTrue
DataFrame 2Second input DataFrameTrue
DataFrame NNth input DataFrameFalse
Join Condition (Conditions tab)The join condition specifies how the rows will be combined.True
Type (Conditions tab)The type of JOIN (Inner, Full Outer, Left , Right , Left Semi, Left Anti)True
Where Clause (Conditions tab)Filter applied after the Join operationFalse
Target column (Expressions)Output column nameFalse
Expression (Expressions)Expression to compute target column. If no expression is given, then all columns from all DataFrames would reflect in output.False
Hint Type (Advanced)The type of Join Hint (Broadcast, Merge, Shuffle Hash, Shuffle Replicate NL or None). To read more about join hints click hereFalse
Propagate All Columns (Advanced)If true, all columns from that DataFrame would be propagated to output DataFrame. Equivalent to selecting df.* for the selected DataFrame.False

Adding a new input

  1. Click on the plus icon to add a new input.
  2. Then add your condition expression for the newly added input. Example usage of Join - Add new input to join gem

Examples

Example 1 - Join with three DataFrame inputs

Example usage of Join - Join three DataFrame inputs

def Join_1(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.alias("in0")\
.join(in1.alias("in1"), (col("in0.customer_id") == col("in1.customer_id")), "inner")\
.join(in2.alias("in2"), (col("in1.customer_id") == col("in2.customer_id")), "inner")

Example 2 - Join with Hints

Join hints allow users to suggest the join strategy that Spark should use. For a quick overview, see Spark's Join Hints documentation.

Example usage of Join - Join with hints

def Join_1(spark: SparkSession, in0: DataFrame, in1: DataFrame, in2: DataFrame) -> DataFrame:
df1 = in1.hint("merge")

return in0\
.alias("in0")\
.hint("broadcast")\
.join(df1.alias("in1"), col("in0.customer_id") == col("in1.customer_id"), "inner")\
.join(in2.alias("in2"), col("in0.customer_id") == col("in1.customer_id"), "inner")
object Join_1 {
def apply(spark: SparkSession, in0: DataFrame, in1: DataFrame, in2: DataFrame): DataFrame =
in0
.as("in0")
.hint("broadcast")
.join(in1.as("in1").hint("merge"), col("in0.customer_id") === col("in1.customer_id"), "inner")
.join(in2.as("in2"), col("in1.customer_id") === col("in2.customer_id"), "inner")
}

Example 3 - Join with Propagate Columns

Step 1 - Specify join condition

def Join_1(spark: SparkSession, in0: DataFrame, in1: DataFrame, ) -> DataFrame:
return in0\
.alias("in0")\
.join(in1.alias("in1"), (col("in0.customer_id") == col("in1.customer_id")), "inner")\
.select(*[col("in1.email").alias("email"), col("in1.phone").alias("phone")], col("in0.*"))
object Join_1 {

def apply(spark: SparkSession, in0: DataFrame, in1: DataFrame): DataFrame =
in0
.as("in0")
.join(in1.as("in1"), col("in0.customer_id") === col("in1.customer_id"), "inner")
.select(col("in1.phone").as("phone"), col("in1.email").as("email"), col("in0.*"))

}

Types of Join

Suppose there are 2 tables TableA and TableB with only 2 columns (Ref, Data) and following contents:

Table A

RefData
1Data_A11
1Data_A12
1Data_A13
2Data_A21
3Data_A31

Table B

RefData
1Data_B11
2Data_B21
2Data_B22
2Data_B23
4Data_B41

INNER JOIN

Inner Join on column Ref will return columns from both the tables and only the matching records as long as the condition is satisfied:

RefDataRefData
1Data_A111Data_B11
1Data_A121Data_B11
1Data_A131Data_B11
2Data_A212Data_B21
2Data_A212Data_B22
2Data_A212Data_B23

LEFT JOIN

Left Join (or Left Outer join) on column Ref will return columns from both the tables and match records with records from the left table. The result-set will contain null for the rows for which there is no matching row on the right side.

RefDataRefData
1Data_A111Data_B11
1Data_A121Data_B11
1Data_A131Data_B11
2Data_A212Data_B21
2Data_A212Data_B22
2Data_A212Data_B23
3Data_A31NULLNULL

RIGHT JOIN

Right Join (or Right Outer join) on column Ref will return columns from both the tables and match records with records from the right table. The result-set will contain null for the rows for which there is no matching row on the left side.

RefDataRefData
1Data_A111Data_B11
1Data_A121Data_B11
1Data_A131Data_B11
2Data_A212Data_B21
2Data_A212Data_B22
2Data_A212Data_B23
NULLNULL4Data_B41

FULL OUTER JOIN

Full Outer Join on column Ref will return columns from both the tables and matching records with records from the left table and records from the right table . The result-set will contain NULL values for the rows for which there is no matching.

RefDataRefData
1Data_A111Data_B11
1Data_A121Data_B11
1Data_A131Data_B11
2Data_A212Data_B21
2Data_A212Data_B22
2Data_A212Data_B23
3Data_A31NULLNULL
NULLNULL4Data_B41

LEFT SEMI JOIN

Left Semi Join on column Ref will return columns only from left table and matching records only from left table.

RefData
1Data_B11
1Data_B21
1Data_B22
2Data_B23
3Data_B41

LEFT ANTI JOIN

Left anti join on column Ref will return columns from the left for non-matched records :

RefDataRefData
3Data_A31NULLNULL