Skip to main content

DataQualityCheck

Spark Gem

The DataQualityCheck gem includes a variety of checks that are built on the open source tool Deequ. Use this gem to make sure your data adheres to predefined constraints.

Requirements

Dependencies

  • ProphecySparkDataQualityPython 0.0.1+

Cluster requirements

  • Set the SPARK_VERSION environment variable to a Spark version 3.3+
  • Install the PyDeequ library on the cluster
  • Install the Maven Deequ library on the cluster (choose the version that matches your Spark version)

Input and Output

DataFrameDescription
in0Includes the DataFrame for which the data quality will be checked.
out0Passes through the in0 DataFrame unchanged.
out1Outputs a DataFrame with the verification results and failure messages (if applicable) that you can define per check.

Data Quality Checks

Check TypeSuccess Criteria
CompletenessFraction of non-null values is greater than X. The default is 100% non-null.
Row countInput DataFrame has at least X number of rows.
Distinct countNumber of distinct values in selected columns is equal to X.
UniquenessValues in selected columns are unique (occurring exactly once).
Data typeSelected columns have a certain data type.
Min-max lengthStrings in selected columns have a minimum length of X and a maximum length of Y.
Total sumSum of values in selected columns is equal to X.
Mean valueMean of values in selected columns is equal to X.
Standard deviationStandard deviation of values in selected columns is equal to X.
Non-negative valueFraction of non-negative values is at least X percent.
Positive valueFraction of positive values is at least X percent.
LookupFraction of values in selected columns that match lookup values is at least X percent.
Column to constant value greater thanSelected column values are greater than a constant value X.
Column to constant value greater than or equal toSelected column values are greater than or equal to a constant value X.
Column to constant value less thanSelected column values are less than a constant value X.
Column to constant value less than or equal toSelected column values are less than or equal to a constant value X.
Column to column greater thanAll values in left column are greater than all values in right column.
Column to column greater than or equal toAll values in left column are greater than or equal to all values in right column.
Column to column less thanAll values in left column are less than all values in right column.
Column to column less than or equal toAll values in left column are less than or equal to all values in right column.

Post Actions

ActionDescription
Continue executionContinue pipeline execution regardless of data quality success or failure.
Terminate executionStop pipeline execution after the DataQualityCheck gem runs based on a maximum number of failed checks. Review gem phases to understand the order in which gems run.

Code

Each check is compiled into highly decipherable code. When you switch to the code view and open the DataQualityCheck file, you can see that every check includes a hint that populates the out1 constraint_message column if you do not define a custom failure message.