Skip to main content

Lookup

Lookups are a special kind of Gem that allow you to mark a particular DataFrame as a Broadcast DataFrame. Spark will ensure that this data is available on every computation node so that these lookups can be done without shuffling data. This is useful for looking up values in tables, hence the name.

Lookup Gem

Lookup UI

NameDescription
1Key ColumnsSpecify one or more columns to use as the lookup key in the source DataFrame
2Value ColumnsPick which columns can be referenced wherever this Lookup is used

Using Lookups

Lookups can be used wherever any other Expression can be used, but usage depends on your Expression language of choice. Lookup references follow a certain pattern:

lookup("<LOOKUP NAME>", <KEY COLUMN>).getField(<VALUE COLUMN>)

Column-based lookups

So, based on our above MyLookup example we'd use:

lookup("MyLookup", col("customer_id")).getField("order_category")

For example, let's look at a Reformat component:

Reformat example

Here we have a column named category that is set to the value of MyLookup(customer_id)['order_category'] in SQL Expression mode. Whatever the value of order_category is for the key found in the c_id column (compared to the source customer_id key column) will be used for the new column.

Literal lookups

Since any column reference can be used in Lookup expressions, you can use Lookups with static keys:

lookup("MyLookup", lit("0000")).getField("order_category")

In this case, the expression evaluates to the value of order_category where customer_id is 0000. This can be useful in situations when you want to have a table of predefined keys and their values available in Expressions.