Skip to main content

Deduplicate

SQL

When working with data, it’s common to run into duplicate information. Duplicates can come from multiple data sources, system errors, or repeated updates over time.

Leverage the Deduplication gem to remove these duplicates, and pay close attention to the Deduplication mode that you use.

Parameters

ParameterDescription
ModeDeduplication method
ExpressionColumn(s) to check for duplicates
Use Custom Order BySort rows before deduplicating (First and Last mode only).

Mode

Next to Deduplicate On Columns, choose how to keep certain rows.

ModeDescriptionOutput
Distinct RowsKeeps one version of each duplicated row, removing extra duplicates.All columns are passed through unless target columns are specified.
Unique OnlyKeeps only the rows that appear exactly once and removes any duplicate rowsAll columns are passed through.
FirstKeeps the first occurrence of the duplicate row.All columns are passed through.
LastKeeps the last occurrence of the duplicate row.All columns are passed through.

Example

Assume you have a table of contact information where some people appear more than once. This could be due to updates over time or repeated data entry. You want to identify and clean up these duplicates based on fields like email or phone number.

Here is your original table:

emailphonefirst_namelast_namedate_added
alex.t@example.com123-456-7890AlexTaylor2023-01-01
alex.t@example.com123-456-7890AlexTaylor2023-07-01
sam.p@example.com987-654-3210SamPatel2024-03-15
casey.l@example.com555-111-2222CaseyLee2024-05-01
casey.l@example.com555-111-2222CaseyLee2025-01-01
jordan.k@example.com333-444-5555JordanKelly2023-09-10
morgan.s@example.com666-777-8888MorganSmith2025-01-01

Distinct Rows

Let’s look at what happens with our original table when using Distinct Rows without selecting any specific columns. If two rows match exactly—every value in every column—they are considered duplicates, and only one will be kept.

emailphonefirst_namelast_namedate_added
alex.t@example.com123-456-7890AlexTaylor2023-01-01
alex.t@example.com123-456-7890AlexTaylor2023-07-01
sam.p@example.com987-654-3210SamPatel2024-03-15
casey.l@example.com555-111-2222CaseyLee2024-05-01
casey.l@example.com555-111-2222CaseyLee2025-01-01
jordan.k@example.com333-444-5555JordanKelly2023-09-10
morgan.s@example.com666-777-8888MorganSmith2025-01-01

The result is identical to the input because no two rows are exact matches across all columns.

Distinct Rows with Target Columns

Often, you want to identify duplicates based on just one or a few columns, like email. You can do this by selecting the columns to deduplicate on. Here is the result when email is selected as the target column:

email
alex.t@example.com
sam.p@example.com
casey.l@example.com
jordan.k@example.com
morgan.s@example.com

Only the distinct values from the email column are returned. Other columns are not included, because Prophecy doesn’t assume which corresponding values (like phone or date_added) to keep. If you want to keep related columns, use the First or Last deduplication methods.

First and Last

The First and Last options help you keep just one row for each duplicate based on the order of the data. You’ll typically combine this with sorting to control which version is retained.

For example, if you want to keep the most recent entry for each duplicate email address, you can:

  • Choose First and sort by date_added descending
  • Choose Last and sort by date_added ascending

Here’s the result when deduplicating by email, keeping the most recent record for each:

emailphonefirst_namelast_namedate_added
alex.t@example.com123-456-8888AlexTaylor2023-07-01
sam.p@example.com987-654-3210SamPatel2024-03-15
casey.l@example.com555-111-2222CaseyLee2025-01-01
jordan.k@example.com333-444-5555JordanKelly2023-09-10
morgan.s@example.com666-777-8888MorganSmith2025-01-01

Unique Only

The Unique Only option removes all rows with duplicates. It keeps only the rows that appear exactly once, based on the selected columns.

For example, if you choose to deduplicate based on email, any row that contains an email address that appears more than once will be removed entirely. Here’s the result when using Unique Only on the original table with email as the target column:

emailphonefirst_namelast_namedate_added
sam.p@example.com987-654-3210SamPatel2024-03-15
jordan.k@example.com333-444-5555JordanKelly2023-09-10
morgan.s@example.com666-777-8888MorganSmith2025-01-01