Skip to main content

FuzzyMatch

Use the FuzzyMatch gem to identify non-identical duplicates in your data.

Input and Output

TableDescription
in0Includes the table on which duplicates will be checked.
Note: FuzzyMatch only allows one input.
outGenerates one record per fuzzy match.

Parameters

Configuration

ParameterDescription
Merge/Purge ModeRecords are either compared from a single source (Purge) or across multiple sources (Merge).
Merge mode assumes that multiple sources exist in the same table in0.
Source ID FieldUnique identifier for each source when using Merge mode.
This is necessary because the different sources exist in the same table in0.
Record ID FieldUnique identifier for each record.
Match threshold percentageIf the match score is less than the threshold, the record does not qualify as a match.
Include similarity scoreCheckbox to enable for an additional output column that includes the similarity score.

Match Fields

ParameterDescription
Field nameName of the column that you want to check for duplicates.
Match functionThe method that generates the similarity score.

Example

One common use case for the FuzzyMatch gem is to match similarly spelled names. Here's a table with two entries for Alex Taylor, whose phone number was updated.

idemailphonefirst_namelast_namedate_added
1alex.t@example.com123-456-7890AlexTaylor2023-01-01
2alex.t@example.com123-456-9542AlexTtaylor2023-07-01
3sam.p@example.com987-654-3210SamPatel2024-03-15
4casey.l@example.com555-111-2222CaseyLee2024-05-01

You can use the FuzzyMatch gem to find the closely spelled name. In the gem configuration:

  1. Set the Merge/Purge Mode to Purge mode.
  2. For the Record ID, use the id column.
  3. Keep the threshold at 80 percent.
  4. Enable the Include similarity score column checkbox.
  5. In the Match Fields tab, add a match field for the last_name column.
  6. Set the Match Function to Name.
  7. Save and run the gem.

Result

The output includes the Record IDs of the records with fuzzy matches above the defined threshold.

idid2similarityScore
120.9111111111111111
info

Depending on your SQL provider, you might see different similarity scores based on the algorithm that runs under the hood.

tip

To view the names per record, join the FuzzyMatch output with the original dataset.