Skip to main content

Deduplicate

SQL Gem

Removes rows with duplicate values of specified columns.

Parameters

ParameterDescriptionRequired
SourceInput sourceTrue
Row to keep- Distinct Rows: Keeps all distinct rows. This is equivalent to performing a select distinct operation
- Unique Only: Keeps rows that don't have duplicates
- First: Keeps first occurrence of the duplicate row
- Last: Keeps last occurrence of the duplicate row
Default is Distinct Rows
True
Deduplicate On ColumnsColumns to consider while removing duplicate rows (not required for Distinct Rows)True

Row to keep options

As mentioned in the previous parameters, there are four Row to keep options that you can use in your deduplicate Gem.

Deduplicate row to keep

In the Code view, you can see that the Deduplicate Gem contains SELECT DISTINCT * when using the Distinct Rows option.

Deduplicate code view

Example

Suppose you're deduplicating the following table.

First_NameLast_NameTypeContact
JohnDoephone123-456-7890
JohnDoephone123-456-7890
JohnDoephone123-456-7890
AliceJohnsonphone246-135-0987
AliceJohnsonphone246-135-0987
AliceJohnsonemailalice@johnson.com
AliceJohnsonemailalice@johnson.com
BobSmithemailbob@smith.com

For Distinct Rows, the interim data will show the following:

First_NameLast_NameTypeContact
JohnDoephone123-456-7890
AliceJohnsonphone246-135-0987
AliceJohnsonemailalice@johnson.com
BobSmithemailbob@smith.com

The First and Last options work similarly to Distinct Rows, but they keep the first and last occurrence of the duplicate rows respectively.

For Unique Only, the interim data will look like the following:

First_NameLast_NameTypeContact
BobSmithemailbob@smith.com

You'll be left with only one unique row since the rest were all duplicates.


You can add First_Name and Last_Name to Deduplicate On Columns if you want to further deduplicate the table.

For Distinct Rows, the interim data will show the following:

First_NameLast_Name
JohnDoe
AliceJohnson
BobSmith
note

For First, Last, and Unique Only, the interim data will contain all columns, irrespective of the columns that were added.

For First and Last, the interim data will look like the following:

First_NameLast_NameTypeContact
JohnDoephone123-456-7890
AliceJohnsonphone246-135-0987
AliceJohnsonemailalice@johnson.com
BobSmithemailbob@smith.com

For Unique Only, the interim data will look like the following:

First_NameLast_NameTypeContact
BobSmithemailbob@smith.com