Identify data quality issues with duplication of data
In this we will learn to identify data quality issues with duplication of data.
The Data Quality Services (DQS) data matching process enables you to reduce data duplication and improve data accuracy in a data source. Matching analyzes the degree of duplication in all records of a single data source, returning weighted probabilities of a match between each set of records compared. You can then decide which records are matches and take the appropriate action on the source data.
The DQS matching process has the following benefits:
- Firstly, matching enables you to eliminate differences between data values that should be equal, determining the correct value and reducing the errors that data differences can cause. For example, names and addresses are often the identifying data for a data source, particularly customer data. But, the data can become dirty and deteriorate over time.
- Secondly, matching enables you to ensure that values that are equivalent, but were entered in a different format or style, are rendered uniform.
- Thirdly, matching identifies exact and approximate matches, enabling you to remove duplicate data as you define it. However, you define the point at which an approximate match is in fact a match. You define which fields are assessed for matching, and which are not.
- Next, DQS enables you to create a matching policy using a computer-assisted process, modify it interactively based upon matching results. And add it to a knowledge base that is reusable.
- Lastly, you can re-index data copied from the source to the staging table, or not re-index. This depends on the state of the matching policy and the source data.
The following illustration displays how data matching is done in DQS:

How to Perform Data Matching
As with other data quality processes in DQS, you perform matching by building a knowledge base and executing a matching activity in a data quality project in the following steps:
- Firstly, create a matching policy in the knowledge base
- Secondly, perform a de-duplication process in a matching activity that is part of a data quality project.
Building a Matching Policy
- Firstly, you prepare the knowledge base for performing matching by creating a matching policy in the knowledge base to define how DQS assigns matching probability. However, a matching policy consists of one or more matching rules that identify which domains will be used when DQS assesses how well one record matches to another, and specify the weight that each domain value carries in the matching assessment.
- Then, you specify in the rule whether domain values have to be an exact match or can just be similar, and to what degree of similarity. You also specify whether a domain match is a prerequisite.
- Next, the matching policy activity in the Knowledge Base Management wizard analyzes sample data by applying each matching rule to compare two records at a time throughout the range of records. Records whose matching scores are greater than a specified minimum are grouped in clusters in the matching results.
- Lastly, you can specify for a domain that data strings will be normalized when you load data from the data source into the domain. This process consists of replacing special characters with a null or a space. This often removes the difference between two strings. This can increase matching accuracy, and can often enable a matching result to surpass the minimum matching threshold, when without normalization it would not pass.
Running a Matching Project
- DQS performs data de-duplication by comparing each row in the source data to every other row, using the matching policy defined in the knowledge base, and producing a probability that the rows are a match. This is done in a data quality project with a type of Matching. However, matching is one of the major steps in a data quality project. It is best performed after data cleansing, so that the data to be matched is free from error.
- However, a data matching project consists of a computer-assisted process and an interactive process. The matching project applies the matching rules in the matching policy to the data source to be assessed. This process assesses the likelihood that any two rows are matches in a matching score.
- Further, when DQS performs the matching analysis, it creates clusters of records that DQS considers matches. DQS randomly identifies one of the records in each cluster as the pivot, or leading, record. The data steward verifies the matching results, and rejects any record that is not an appropriate match for a cluster.
- Lastly, you can export the results of the matching process either to a SQL Server table or a .csv file. You can export matching results in two forms. Firstly, the matched records and the unmatched records. Secondly, survivorship records that include only the survivor record for a cluster and the unmatched results.
Reference: Microsoft Documentation