How to Implement DQS (Data Quality Services)

The Situation
We needed to load data from many different sources, and in order to match Gala apples to Gala apples and not Fuji apples, we needed a way to map/translate the items. I had architected a solution to create new mapping tables in MDS, but I hadn’t implemented it yet.

Then I attended SQL PASS 2012, and I learned about the new DQS (spellchecker) offering from Microsoft. It sounded like it would do the mapping tables for me, and allow me to keep a clean MDS instance.

I got Matt Masson’s (http://www.mattmasson.com/) sample SSIS package. After looking at his SSIS package in detail, I realized our solution needed a few more steps, so I created a series of sub packages. The master package will run each package in order, and this allows the users to run the DQS package repeatedly if necessary.

Package 0: Load the files from the external sources to a collection location (in case other people want to use the raw data)

Package 1: Load the data from the external sources into staging tables.

Package 2: Pull out the domain data and see if it exists in MDS/DQS. Cleanse the Data and prepare it for insertion/updating MDS.

Package 3: Using the mappings generated in Package 2, load the MDS data.

Package 4: Load the data to an internal system (so other people can access the cleansed data).

Package 5: Pull the data nightly to a reporting system (This does the transforms and mergers of data).

This post will discuss Packages 2 & 3, since those deal with DQS.

First, a few caveats/regrets:

  • I didn’t do everything using best practices (since I was figuring those out as I went) but next time I will. I wanted to get this out to help other folks get up and running. I hope you will do all the recommended practices (like using schema names to separate the tables more securely) and please forgive me for being a little sloppy at times!
  • I have an elegant SSIS logging solution designed, but have not had time to implement it. That said, my rudimentary logging is adequate.
  • Performance wasn’t a big concern with this process, so many things can be (and probably will be) improved upon there in the future.
  • I probably could have done a better layout, perhaps with no crossed lines, but again, that was a low priority in getting this up and running!
  • During the development, it became apparent that we needed a full copy of the DQS translations for parallel processes that we set up, so I would recommend generating those for manual data validation (instead of the transient Translation tables).

Setup
For each Domain, I created 4 tables similar to the following.

DqsReview_Country This is a place holder for the DQS Unknown/New itemsCREATE TABLE [dbo].[DqsReviewCountry](

[rowid][int] IDENTITY(1,1) NOT NULL,

[Record Status][nvarchar](255) NULL,

[Item_Source][nvarchar](255) NULL,

[Item_Output][nvarchar](255) NULL,

[Item_Status][nvarchar](100) NULL

) ON [PRIMARY]

MdsReview_Country This is the place where we put the new MDS itemsCREATE TABLE [dbo].[MdsReviewCountry](

[rowid][int] IDENTITY(1,1) NOT NULL,

[Record Status][nvarchar](255) NULL,

[Item_Source][nvarchar](255) NULL,

[Item_Output][nvarchar](255) NULL,

[Item_Status][nvarchar](100) NULL

) ON [PRIMARY]

MdsStage_Country This holds the code / name / fields of items pulled from the MDS ModelCREATE TABLE [dbo].[MdsStageCountry](

[ID][int] IDENTITY(1,1) NOT NULL,

[Code][nvarchar](250) NULL,

[Name][nvarchar](250) NULL,

[NewCode][nvarchar](250) NULL,

[BigArea][nvarchar](100) NULL,

[A13][nvarchar](100) NULL,

[Region][nvarchar](100) NULL,

[SubRegion][nvarchar](100) NULL,

[Subsidiary][nvarchar](100) NULL,

CONSTRAINT [pk_MdsStageCountry] PRIMARY KEY CLUSTERED

(

[ID] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

Translation_Country This brings it all together. Anything that is in DqsReview is added as Unknown. MdsReview is added as if it was mapped. MdsStage as is.CREATE TABLE [dbo].[TranslationCountry](

[rowid][int] IDENTITY(1,1) NOT NULL,

[Item_Source][nvarchar](255) NULL,

[Item_Corrected][nvarchar](255) NULL

) ON [PRIMARY]

Package 2 Data Cleansing
This package can be run multiple times because it clears out the data and refreshes it based on the current information in DQS. For example, if the data steward had gone into DQS and updated the mappings/added new, it would come thru in the next run. Likewise, if they made corrections in MDS, those would come thru in the next run of this sub package.

SQL Clear Staging Tables
After creating a unique JobId (based on date/time) and logging that we’ve started, I clear all the consumption tables in a SQL task with TRUNCATE TABLE. E.G.

  • TRUNCATE TABLE DqsReviewCountry
  • TRUNCATE TABLE MdsReviewCountry
  • TRUNCATE TABLE MdsStageCountry
  • TRUNCATE TABLE TranslationCountry

How to Implement DQS (Data Quality Services)

DFT MDS Consumption Views to StagingDB
Next, I pull down all the current data in MDS to the staging tables using the consumptions views I created in my model. Since the users could make edits through the Admin UI directly, I want to make sure I always have the most current set of data for MDS.

How to Implement DQS (Data Quality Services)

DFT Cleanse Country Data
Now that the MDS data exists, we start the cleansing process.

How to Implement DQS (Data Quality Services)

OLE DB Source Distinct Country
This is just a select distinct of all the terms in the source table.

SELECT DISTINCT

Country

FROM

ExternalSource_Stage

MDS – Find Items Raw
For performance reasons, the speakers at SQL PASS mentioned that it is better to filter out all the known items, and I think it works well too. We only have DQS try to resolve items that MDS has never seen.

How to Implement DQS (Data Quality Services)

In this example, I will only match on the one term I’m trying to correct. You may need something more complicated if you are using a composite domain. We send the results of the Lookup Match to the Union All to go to the Translation Table for use later.

DQS Cleansing
This is where magic happens. DQS takes each item and tries to match it. Anything that’s an exact match, it labels “Correct”. Anything that it can make a good guess with will be labeled as “Corrected”. Other than that, everything is a new record.

How to Implement DQS (Data Quality Services)

Split Corrected Records
New records go up to our DqsReview Table. This isn’t strictly required, since DQS created a project and stores all the data that needs to be reviewed there. However, I like having this here for troubleshooting purposes. It’s also nice to be able to verify that all the numbers match exactly.

How to Implement DQS (Data Quality Services)

MDS – Find Items Corrected
We send both the Correct and Corrected records to be verified against MDS again. Just because DQS knows what a record is, doesn’t mean it’s in MDS yet!

Anything that matches goes to the Translation table. Anything that doesn’t match needs more processing.

Split for Auto Suggest
If MDS doesn’t have the term, it’s either new to MDS or needs to be mapped to something else. Anything with a status of “Auto Suggest” will go up to DQS to be reviewed by the Data Steward. (Again, dummy table, this info is already captured in the DQS project that you can access thru the DQS UI).

How to Implement DQS (Data Quality Services)

UNION ALL Corrected and Auto Suggest
We want to merge these two sets so we have the complete story, but also so we can populate the translation table with the values that are “Unknown”.

Multicast Send to DQS Review and Translation Table
This makes sure the Unknowns go to both destinations we care about.

OLE DB Store for DQS Review
This is just a write to the DqsReview table.

Derived Column Replace Output with Unknown
Anything that went to DQS for review is translated to “Unknown” so we can still load the data.

How to Implement DQS (Data Quality Services)

Multicast
Any record that goes to the MdsReview table (which is used for inserting new items in MDS) also needs to go to the Translation table so it can be used in the import.

ALL – Send Translation table for use with import
Joins all the data from the different parts of the process so we have all the data to map values with.

OLE DB Destination Mds Review
Used for inserting new items into MDS in the next package.

OLE DB Destination Translation Table
Used for mapping the values from the source tables to MDS in the next package.

Here is some sample data that will be in Translation table at the end of this process:

Item_Source Item_Corrected Explanation
United States United States Known in MDS, added after “MDS – Find Items Raw”
US United States Corrected by DQS, added after  “MDS – Find Items Corrected”
USA Unknown Not found in either MDS steps, added after “Derived column replace Output with Unknown” or “Split for AutoSuggest”

Package 3 PushMds

SEQ Get Table Physical Mds Names
In setting up our MDS Model, I identified the data that would be “domain” data. We had Country, OEM, Product Name, and a few others. However, there are other teams that have models with the same table names in the same MDS Dev, Test and Production instances, with different structures and slightly different data. There is no way to guarantee what the name of the tables would be when I deployed between environments. Would it be Country_Leaf, Country_1_Leaf, Country_1_1_Leaf, Country_2_Leaf? I saw all of those options occur.

To resolve this, I created SQL statements to pull back the physical names into variables. But then I also set these names in the Config file because I can’t run the packages directly thru VS if the name validation fails. (Your mileage may vary with this step.)

How to Implement DQS (Data Quality Services)

SELECT [StagingBase]

FROM [mdm].[tblEntity]

WHERE Model_ID =

(SELECT ID

FROM [mdm].[tblModel]

WHERE Name = ‘MyModelName’) and Name = ‘Country’

SEQ Load and Process MDS Tables

How to Implement DQS (Data Quality Services)

DFT Country
To dynamically create the Destination table name, I had to use ADO.NET destination. It exposes the table Property on the Data Flow Task. Thank you Matthew Roche (http://ssimagine.wordpress.com/) for showing me that at SQL PASS!

Here is the Expression: “stg.” + @[User::tableCountry] + “_Leaf”

How to Implement DQS (Data Quality Services)

This is what the Dataflow task looks like:

How to Implement DQS (Data Quality Services)

The OLE DB Source Country has this logic:

— Straight select from MdsReview table

— We don’t update Country values this way, only insert

SELECT

rowid
, [Record Status]
, Item_Source
, Item_Output
, Item_Status

FROM

MdsReviewCountry Mds

WHERE

MDS.Item_Output not in (SELECT Name FROM MdsStageCountry  WHERE Name != ‘Unknown’)

I don’t want to insert anything that hasn’t been blessed by DQS. I add the following columns:

How to Implement DQS (Data Quality Services)

I had to use an ADO.NET Destination because it was the only way to programmatically update the table name.

SQL stg udp_Country_Leaf
I build this query as well with an expression. “EXEC [stg].[udp_” + @[User::tableCountry]  + “_Leaf] ‘” +  @[User::VersionName] + “‘, 1, ‘” +  @[User::JobStartTime] + “‘, NULL”

SQL Clear Mds Staging Table Country
Now that we have updated MDS data, we want to make sure we have the latest version so we can correctly process our source data.

TRUNCATE TABLE MdsStageCountry

DFT Pull Updated MDS Stage Country
This is straight forward because we know what the consumption view name is – that doesn’t change between environments since I preface it with our Model Name. E.G. MyModelName_Country

How to Implement DQS (Data Quality Services)

Great! That’s how you do a simple – straight forward insert. But not everything is that simple! What about updates/inserts? For example, Product. Product uses several domain values to create a single one.

I.E. Product Name, OEM, Version, Reseller (they sometimes name products things differently than the OEM).

Thanks for reading! Questions? Comments? Please send us email for more information.

Phone: 312-602-4000
Email: marketing@westmonroepartners.com
222 W. Adams
Chicago, IL 60606
Show Buttons
Share On Facebook
Share On Twitter
Share on LinkedIn
Hide Buttons