Limit component limits the number of rows being output. It is often used in reports that have aggregation and sorting where we might only be interesting in knowing which regions, or product lines are brining us maximum sales.

Note: When developing a workflow, you’ll see the limit component reduce the rows output where it is placed in the workflow. But when you’re running in production, the optimizer will run and limit the rows coming into the pipeline much soon, often moving the limit all the way to source components and reading very few lines of data to begin with.


In this example, we’ll read some data, take a few sample rows and write them out to a target



The limit component is fairly straightforward and the only option to pick is the number of rows to be output. Here we’re picking a hundred.



Example Code

The code is very simple and only has the limit operator.

object KeepSample {

  def apply(spark: SparkSession, in: DataFrame): Limit = {
    import spark.implicits._

    val out = in.limit(100)




# Python code coming soon!