Enhancing Data Quality: A Practical Introduction to PyDeequ and Automated Monitoring — Part 1

4 min readApr 25, 2023

Poor data quality can negatively impact machine learning models and business decision-making. Unresolved data errors can lead to data scars and data shocks, which have long-term consequences. Implementing automated data quality monitoring tools is crucial in the current era to detect and resolve issues promptly, leading to increased trust in data and enhanced efficiency in data processing. Therefore, automated data quality monitoring should be considered as a strategy for improving and maintaining an organization’s data systems.

Automated data quality monitoring is a process that continuously checks and validates the quality of data in an organization’s systems using automated tools and techniques. It aims to identify and fix data quality issues quickly, minimizing the impact on analytics, decision-making, and machine learning models.

Key aspects of automated data quality monitoring include:

Data Profiling: Analyzing data sets to understand their structure, relationships, patterns, and anomalies, which helps in identifying potential data quality issues.
Data Validation: Applying rules and constraints to ensure that data conforms to predefined requirements or business rules, such as data type, range, uniqueness, and consistency.
Anomaly Detection: Identifying unusual or unexpected data patterns that may indicate data quality issues, using statistical techniques, machine learning algorithms, or other methods.
Data Quality Dashboards: Visualizing data quality metrics and trends to provide insights into the overall health of the data and to help stakeholders make informed decisions.
Issue Resolution: Automatically or semi-automatically fixing data quality issues, either by correcting the data, flagging it for manual review, or informing relevant teams to take appropriate actions.
Alerts and Notifications: Proactively notifying data teams or other stakeholders when data quality issues are detected, enabling them to address issues before they cause significant harm.
Continuous Monitoring: Regularly assessing data quality, tracking changes, and ensuring that data quality remains high over time.

By implementing automated data quality monitoring, organizations can increase trust in their data, reduce the risk of data scars and shocks, improve the performance of analytics and machine learning models, and enable more accurate decision-making.

To help illustrate the benefits of automated data quality monitoring, particularly focusing on aspects 1 (Data Profiling) and 2 (Data Validation), let’s demonstrate how to use PyDeequ, a Python API for Deequ, to analyze the data quality of a sample dataset using PySpark.

Setting up the PySpark environment:

First, we need to set up a PySpark environment and initialize a Spark session:

from pyspark.sql import SparkSession
from pydeequ.checks import Check, CheckLevel
from pydeequ.verification import VerificationSuite

# Initialize the Spark session
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("DEEQU").\
        master("spark://spark-master:7077").\
        config("spark.jars", "/opt/workspace/test/deequ-1.2.2-spark-3.0.jar").\
        config("spark.executor.memory", "3000m").\
        config("spark.executor.cores", "2").\
        config("spark.cores.max", "6").\
        getOrCreate()

Sample Data:

Let’s create a sample DataFrame for demonstration purposes and show it:

# Sample data for demonstration purposes
data = spark.createDataFrame([
    (1, "A", 100),
    (2, "B", None),
    (3, "C", 200),
    (4, None, 300),
    (5, "E", None)
], ["id", "name", "value"])

data.show()

+---+----+-----+
| id|name|value|
+---+----+-----+
|  1|   A|  100|
|  2|   B| null|
|  3|   C|  200|
|  4|null|  300|
|  5|   E| null|
+---+----+-----+

Analyzing Data Quality with PyDeequ:

Next, we’ll use PyDeequ’s AnalysisRunner to analyze the data quality by calculating various metrics like size, completeness, approximate distinct count, mean, and compliance:

from pydeequ.analyzers import (
    AnalysisRunner,
    AnalyzerContext,
    ApproxCountDistinct,
    Completeness,
    Compliance,
    Mean,
    Size,
)

analysisResult = (
    AnalysisRunner(spark)
    .onData(data)
    .addAnalyzer(Size())
    .addAnalyzer(Completeness("name"))
    .addAnalyzer(ApproxCountDistinct("name"))
    .addAnalyzer(Mean("value"))
    .addAnalyzer(Compliance("value", "value > 200"))
    .run()
)

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

This code will calculate various data quality metrics for the sample dataset and display them in a DataFrame:

+-------+--------+-------------------+-----+
| entity|instance|               name|value|
+-------+--------+-------------------+-----+
|Dataset|       *|               Size|  5.0|
| Column|   value|         Compliance|  0.2|
| Column|    name|       Completeness|  0.8|
| Column|    name|ApproxCountDistinct|  4.0|
| Column|   value|               Mean|200.0|
+-------+--------+-------------------+-----+

We have calculated various metrics, which address Data Profiling and Data Validation aspects of automated data quality monitoring. Data Profiling is demonstrated through the calculation of completeness and approximate distinct count for the ‘name’ column, and mean for the ‘value’ column. The results indicate that the ‘name’ column has a completeness of 0.8, as 4 out of 5 rows contain non-null values, and an approximate distinct count of 4.

Data Validation is demonstrated through the calculation of compliance for the ‘value’ column. The results show a compliance of 0.2, with only 1 out of 5 rows having a value greater than 200.

Acknowledgments:

I would like to extend my gratitude to the insightful article, “How to Check Data Quality in PySpark”, which has been a significant source of inspiration for this work. The valuable ideas presented in that article have greatly contributed to the development of my own understanding and writing on this topic.

Summary:

Our demonstration of PyDeequ effectively showcases its ability to analyze and validate data quality, addressing key aspects 1 (Data Profiling) and 2 (Data Validation) of automated data quality monitoring. By incorporating such tools, organizations can improve their data quality, leading to more accurate analytics, machine learning models, and decision-making processes.

Keep an eye out for the next installment in our series, where we will delve further into the remaining five aspects of automated data quality monitoring and explore the associated automation processes. This will equip you with a comprehensive understanding of how to bolster and maintain your organization’s data quality. You won’t want to miss the invaluable insights and hands-on knowledge that our forthcoming article has to offer. So, join us as we continue our journey into the captivating world of automated data quality monitoring.

Enhancing Data Quality: A Practical Introduction to PyDeequ and Automated Monitoring — Part 1

Written by Aekanun Thongtae