Developer

RowDistributor

RowDistributor is used to split incoming data into multiple streams. Often incoming data has to be duplicated, or data is to be split into multiple streams - perhaps for different busines units or geographies.

.

Example Usage

You can create multiple outputs, where each output is represented by a tab. The tab name is the name of the output port. In each tab, you put an expression that filters the rows that should go to that port. There might be rows that go to multiple ports based on the conditions.

Let’s look at example data about customers of an e-commerce company:

Incoming data

Now, we see that there are various values of country_code, and we say we want to split out US, Canada, Russia and put all the others in the other bucket. For this we’d create a structure like this.

.

.

To create this, we’ll create a RowDistributor with 4 outputs like this:

.

.

Here, in the first three tabs, we’re picking one country each. However, in the last we want to add a catch all statement for the other countries:

.

.

Example Code

Here is the generated code for this component:


object SplitByCountry {

  def apply(spark: SparkSession, in: DataFrame): RowDistributor4 = {
    import spark.implicits._

    val out_cn     = in.filter(col("country_code") === lit("CN"))
    val out_us     = in.filter(col("country_code") === lit("US"))
    val out_ru     = in.filter(col("country_code") === lit("RU"))
    val out_others = in.filter(!col("country_code").isin("US", "RU", "CN"))

    (out_cn, out_us, out_ru, out_others)

  }

}    

# Python code coming soon!