Skip to main content

FuzzyMatch

Use the FuzzyMatch gem to identify non-identical duplicates in your data.

Input and Output

DataFrameDescription
in0Includes the DataFrame on which duplicates will be checked.
outGenerates one record per fuzzy match.

Parameters

ParameterTabDescription
Merge/Purge ModeConfigurationRecords are either compared from a single source (Purge) or across multiple sources (Merge).
Source ID FieldConfigurationUnique identifier for each source when using Merge mode.
Record ID FieldConfigurationUnique identifier for each record.
Match threshold percentageConfigurationIf the match score is less than the threshold, the record does not qualify as a match.
Include similarity scoreConfigurationCheckbox to enable for an additional output column that includes the similarity score.
Field nameMatch FieldsName of the column that you want to check for duplicates.
Match functionMatch FieldsThe method that generates the similarity score.

Example

One common use case for the FuzzyMatch gem is to match similarly spelled names. This can be useful for identifying accidentally misspelled names.

  1. Create a FuzzyMatch gem and use the customer_id as the Record ID. Then, add a match field for the first_name column.

    FuzzyMatch names

  2. Run the gem and see that the output includes the Record IDs of the records with fuzzy matches.

    FuzzyMatch output

  3. Join the output with the original dataset to view the matched names.

    FuzzyMatch joined

Example code

tip

To see the generated source code of your project, switch to the Code view in the project header.

def fuzzy_match_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
from prophecy.utils.transpiler.dataframe_fcns import fuzzyPurgeMode

return fuzzyPurgeMode(
in0,
spark,
recordId = "customer_id",
matchFields = {"first_name" : "name"},
threshold = 0.8,
includeSimilarityScore = True
)