To most, data analysis is as exciting as auditing or watching paint dry. Strangely, I enjoy scrutinizing and exploring data, and do it frequently when implementing CRM applications and data warehouses.
Properly analyzing your data before a data migration and integration is extremely important, because incomplete analysis leads to flawed logic, and a lot of hasty data remodeling and rewriting of ETL code before go-live.
Below are some steps I go through to understand existing data before ETL development. It’s quite tedious and by no means complete, but hopefully there are bits and pieces that you can leverage.
- Reports and UIs – Track down as many reports and forms as you can (web pages, app forms, even Excel files with summary data) and study them. Have business users point out important fields (this will come in handy when you get to step 4.)
- Tables – After step 1, you should have a better understanding of the application and the data from the business perspective. Now go to the database. Open each table and execute something like SELECT * FROM [TABLE1] ORDER BY [MODIFIEDDATE] DESC LIMIT 10. Copy the result set into a single spreadsheet and use that spreadsheet to take notes. Having quick access to the table structure and sample data along with your notes helps you learn the data faster.
- ERD – Using primary keys and foreign keys, diagram just the tables and relationships. In other words, create an entity-relationship diagram. The field info is captured in step 2, so you don’t need the RSD or a complete data model unless you like to see the data types as well. Modeling tools like ERwin and Visio can automatically render the ERD for you, but I do it manually so that I can examine every relationship during the process. Also, PKs and FKs are not sometimes set in the database, so the tool will not draw those relationships.
- Fields – Now the painful part – scan values in each column. Pull the first 250 and last 250 rows in each table and look for odd values while scrolling through the records. e.g. date columns with 1900-01-01, first name column with company names, email column with comma separated values, etc. Columns with odd values – as well as the fields that the business identified in step 1 – should get extra scrutiny with SQL code, using Group By, Distinct, Count of nulls, etc.
Here are some more specific things to look for:
A.) Lookup Fields – On fields that should have a finite set of values (like department or contact type) perform a DISTINCT. e.g. SELECT DISTINCT [STATECODE] FROM [TABLE1] , which should yield 50 values. If more than 50 are returned, you may need some clean up or mapping (for example, you might have WA, Wash., Washington, etc.)
B.) “International Value” Fields – Look closely at fields with values formatted differently by country. Address and phone fields are good examples. There might be a province or country name stored in a state field.
C.) Date / Time Zones / Currency Fields – Related to above, make sure you handle date format, time zones, and currency properly during the migration. Existing tables might have date time in PST in one table, and UTC in another.
D.) Fake Required Fields – A field could be required on the UI, but not in the database, so the field might contain nulls. Required fields also tend to have junk data (e.g. Email or Phone is required, so customers enter ‘email@example.com’ or 123-123-1234). Handle these by either excluding the value or entire record during the migration/integration, or by trying to find the real value.
E.) Duplicate Data – Look for duplicate records in each table. Use a combination of fields to identify duplicate records. e.g. first name, last name, email; state and company phone #. Share the potential list of duplicates with business users for reconciliation.
F.) Missing/Orphan Records – If there is no referential integrity, the database allows the user to delete the parent record, leaving the child record intact. Make sure every FK key value is associated to a PK value in the lookup table.
G.) Merges and Splits – Determine if one record in the source table is one record in the target table. If not, figure out the logic for merging or splitting records. Communicate and work with the business to come up with good algorithms to address this.
Lastly, remember that even though data analysis might take a couple of weeks to a month (depending on the number of tables, reports, and applications), it takes months to clean up data! So share your findings weekly with the various data owners, and have them divide and conquer the cleanup.
Hopefully this was informative. Have fun!
Do you have more questions about how to best analyze data, data migrations, or data integration? Leave your comments below, or email us at firstname.lastname@example.org.