Converting Transactional History Data to Snapshot Data (Part 2)

In my previous business intelligence blog, “Converting Transactional History Data to Snapshot Data – Part 1“, we reviewed three challenges involved with converting transactional history data to snapshot data:

  • The First Challenge – Data Uniqueness : Dealing with multiple records for an Entity with the exact same timestamp.  We needed a way to uniquely identify these records.
  • The Second Challenge – Data Granularity : Addressing multiple audit records in a single day for the same entity.  In this solution, we only wanted the last record.
  • The Third Challenge – Data Gaps : Finally, there were gaps in dates when the data didn’t change for an Opportunity that needed to be filled. These gaps could be one day or one year, depending on the state of data and when the Opportunity was changed.

This article shows you how to convert transactional history data to a snapshot dataset, and provides sample SQL code to further illustrate the technical details behind this solution. (The SQL samples below are broken out into multiple steps that build on one another.)

To help illustrate this solution from a technical perspective, let’s build some source tables with data.  (We need a source table to simulate a CRM-like audit and a Date dimension table.) The following SQL script can be used to create these objects and insert sample data:

— First, Simulate the Audit table in CRM

IF NOT EXISTS(SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = ‘tblAudit’)

     BEGIN

CREATE TABLE [dbo].[tblAudit](
AuditID INT NOT NULL IDENTITY(1,1),

OpportunityID INT NOT NULL,

ModifiedOn DATETIME NOT NULL,

BookingTotal DECIMAL(19,4) NOT NULL DEFAULT(0),

DateKey INT NOT NULL,

CONSTRAINT [PK_TBLAUDIT] PRIMARY KEY CLUSTERED

(

AuditID ASC

)

) ON [PRIMARY]

       END

— Second, Create the Date Dimension table for the BI table.

IF NOT EXISTS(SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = ‘DateDim’)

       BEGIN

              CREATE TABLE [dbo].[DateDim](

  [DateKey] int NOT NULL,

[Date][datetime] NOT NULL,

) ON [PRIMARY]

       END

— Third, Load sample data into our Audit table to simulate an Opportunity changing over time.

INSERTINTO tblAudit(OpportunityID, ModifiedOn, BookingTotal, DateKey)

SELECT1, ’07/09/2013 07:40′, 100, 20130709 UNION

SELECT1, ’07/11/2013 10:21′, 150, 20130711 UNION

SELECT1, ’07/11/2013 17:46′, 200, 20130711 UNION

SELECT1, ’07/14/2013 11:30′, 300, 20130714

ORDERBY 2

— Forth, Load sample data into our Date Dimension so we can fill in the gaps.

INSERT [dbo].[DateDim]([DateKey], [Date])

SELECT 20130701, CAST(0x0000A1EE00000000 AS DateTime) UNION

SELECT 20130702, CAST(0x0000A1EF00000000 AS DateTime) UNION

SELECT 20130703, CAST(0x0000A1F000000000 AS DateTime) UNION

SELECT 20130704, CAST(0x0000A1F100000000 AS DateTime) UNION

SELECT 20130705, CAST(0x0000A1F200000000 AS DateTime) UNION

SELECT 20130706, CAST(0x0000A1F300000000 AS DateTime) UNION

SELECT 20130707, CAST(0x0000A1F400000000 AS DateTime) UNION

SELECT 20130708, CAST(0x0000A1F500000000 AS DateTime) UNION

SELECT 20130709, CAST(0x0000A1F600000000 AS DateTime) UNION

SELECT 20130710, CAST(0x0000A1F700000000 AS DateTime) UNION

SELECT 20130711, CAST(0x0000A1F800000000 AS DateTime) UNION

SELECT 20130712, CAST(0x0000A1F900000000 AS DateTime) UNION

SELECT 20130713, CAST(0x0000A1FA00000000 AS DateTime) UNION

SELECT 20130714, CAST(0x0000A1FB00000000 AS DateTime) UNION

SELECT 20130715, CAST(0x0000A1FC00000000 AS DateTime)

ORDER BY 1

Selecting from tblAudit shows the transactional history data as the Opportunity data from CRM changes overtime:

SELECT* FROM tblAudit

AuditID OpportunityID ModifiedOn BookingTotal
1 1 2013-07-09 07:40:00.000 100.0000
2 1 2013-07-11 10:21:00.000 150.0000
3 1 2013-07-11 17:46:00.000 200.0000
4 1 2013-07-14 11:30:00.000 300.0000

Selecting from DateDim shows the data in the date dimension:

SELECT* FROM DateDim

DateKey Date
20130701 2013-07-01 00:00:00.000
20130702 2013-07-02 00:00:00.000
20130703 2013-07-03 00:00:00.000
20130704 2013-07-04 00:00:00.000
20130705 2013-07-05 00:00:00.000
20130706 2013-07-06 00:00:00.000
20130707 2013-07-07 00:00:00.000
20130708 2013-07-08 00:00:00.000
20130709 2013-07-09 00:00:00.000
20130710 2013-07-10 00:00:00.000
20130711 2013-07-11 00:00:00.000
20130712 2013-07-12 00:00:00.000
20130713 2013-07-13 00:00:00.000
20130714 2013-07-14 00:00:00.000
20130715 2013-07-15 00:00:00.000

Next, let’s inspect how the data should look after we transform the transaction history data to the snapshot dataset. The snapshot dataset should show data on every day, even if the record was not changed on that day.  Additionally, the data should show the correct value for the field as if a snapshot was taken on each day.  The final dataset must look like the following:

DateKey OpportunityID ModifiedOn BookingTotal CRMAuditOrder
20130709 1 07/09/2013 07:40 AM $100.00 1
20130710 1 07/09/2013 07:40 AM $100.00 1
20130711 1 07/11/2013 05:46 PM $200.00 3
20130712 1 07/11/2013 05:46 PM $200.00 3
20130713 1 07/11/2013 05:46 PM $200.00 3
20130714 1 07/14/2013 11:30 AM $300.00 4

Please notice how the DayKey field increments on every record to cover the snapshot value every day.  It starts when the opportunity was created on 7/9/2013.  It captures data every day and finishes on the last day (in this case, 7/14/2013.) Even though data only changed on July 9th, 11th & the 14th, the records on the 10th, 12th and 13th are populated with the proper value for which the record existed in the system on that day.

The rest of this article will briefly revisit the challenges identified in Part 1 of this post, and show the SQL code used to generate the desired Snapshot dataset as seen above.

The First Challenge
The CRM auditing records didn’t have a datetime stamp with enough precision to always be unique when combined with the Opportunity.  Data analysis found a very small percentage of data with the exact same datetime stamp but different BookingTotal value. (For the purposes of this solution, I identified one record and placed a change request to increase the datetime precision on our field.) Here is the SQL solution to this challenge:

Solution: 
The following code can be used to create the new CRMAuditOrderID unique field.  It shows how to partition the data using the ROW_NUMBER() function with the Over clause:

DECLARE@CRMAudit TABLE (

AuditID INT NOT NULL,

CRMAuditOrderID INT NOT NULL,

DateKey INT NOT NULL,

OpportunityID INT NOT NULL,

ModifiedOn datetime NOT NULL,

BookingTotal decimal(19,4) NOT NULL)

INSERTINTO @CRMAudit (

AuditID,

DateKey,

OpportunityID,

ModifiedOn,

BookingTotal,

       CRMAuditOrderID

)

SELECT

t1.AuditID,

       t1.DateKey,

       t1.OpportunityID,

       t1.ModifiedOn,

       t1.BookingTotal,

       t2.CRMAuditOrderID

FROM tblAudit t1

INNERJOIN(

       SELECT

              t3.AuditID,

ROW_NUMBER() OVER (

PARTITION BY t3.OpportunityID

ORDER BY t3.ModifiedOn

) AS CRMAuditOrderID

FROM tblAudit t3

)AS t2 ON t1.AuditID = t2.AuditID

SELECT* FROM @CRMAudit

The following table illustrates the results from the query above with the new CRMAuditOrderID field:

AuditID CRMAuditOrderID DateKey OpportunityID ModifiedOn BookingTotal
1 1 20130709 1 2013-07-09 07:40:00.000 100.0000
2 2 20130711 1 2013-07-11 10:21:00.000 150.0000
3 3 20130711 1 2013-07-11 17:46:00.000 200.0000
4 4 20130714 1 2013-07-14 11:30:00.000 300.0000

The Second Challenge
As noted from the previous blog post, multiple change records were logged each day for an attribute on a given entity because CRM auditing was turned on in a Dynamics CRM environment.  As such, CRM Auditing will store a transaction history record every time an attribute was changed.  Oftentimes, a CRM user would update an attribute and then update it again seconds later.  Based on the sample data above in the tblAudit table, the opportunity was updated 2 times on July 11th. One update was made at 10:21 AM and another update was made at 5:46 PM.  The last update, on 5:46 PM, is the only record desired in this scenario to meet our data granularity requirement of the last known record on each day.

Solution: 
The following SQL statement will create the MostRecentValueForDay field and only get the AuditID for the last record for each day:

DECLARE@MostRecentValueForDay TABLE (AuditID INT NOT NULL)

INSERTINTO @MostRecentValueForDay(AuditID)

SELECT

       t1.AuditID

FROM tblAudit t1

INNERJOIN(

       SELECT

              t5.AuditID,

ROW_NUMBER() OVER (

PARTITION BY t5.OpportunityID,

CAST(CAST(t5.ModifiedOn AS VARCHAR(11)) AS DateTime)

ORDER BY t5.ModifiedOn DESC

) AS MostRecentValueForDayID

FROM tblAudit t5

)AS t4 ON t1.AuditID = t4.AuditID

WHERE t4.MostRecentValueForDayID = 1

SELECT* FROM @MostRecentValueForDay

The SQL code above partitions the table by OpportunityID and Date.  It then uses the ROW_NUMBER() function with the Over clause to partition the data and generate the unique identifier for each record.  Also note, the code above orders the data by Date in descending order. Any record with a value of 1 will be the last record of the day. Once the ModifiedOn field changes, the sequencing starts over.  If there is only one record on that day, then only one record will exist with a value of 1.

The following table illustrates the AuditID that was identified as the last record on each day from the tblAudit table:

AuditID
1
3
4

The Third Challenge
There are gaps in transactional history data.  The CRM auditing history data only logged records when the user made changes to the Opportunity.  However, having gaps in the dataset makes it difficult to perform snapshot reporting and other calculations.  There are three parts to the solution.  First, SQL code must generate a new sequence number to for performing a unique join.  Second, SQL code must perform a self-join on this new sequence number to determine the starting and stopping points to fill in the gaps in chronological order. Finally, the gaps are filled in with the Date dimension.

Solution Part A: 
The following SQL code will generate a new incrementing SequenceNumber without gaps to identify the order of the last known update of each day:

DECLARE@NewSequenceNumber TABLE (

  AuditID INT NOT NULL,

DateKey INT NOT NULL,

OpportunityID INT NOT NULL,

SequenceNumber INT NOT NULL)

INSERTINTO @NewSequenceNumber(

AuditID,

       DateKey,

       OpportunityID,

       SequenceNumber

)

SELECTcrm.AuditID,

       crm.DateKey,

crm.OpportunityID,

ROW_NUMBER() OVER (

PARTITION BY crm.OpportunityID

ORDER BY crm.ModifiedOn

) SequenceNumber

FROM@CRMAudit AS crm

INNERJOIN @MostRecentValueForDay AS mr ON crm.AuditID = mr.AuditID

SELECT* FROM @NewSequenceNumber

There were only three updates made to this opportunity over 6 days that we care about.  The following tables illustrates the results of creating the new SequenceNumber field.

DateKey OpportunityID ModifiedOn BookingTotal CRMAuditOrder SequenceNumber
20130709 1 2013-07-09 07:40:00.000 100.0000 1 1
20130711 1 2013-07-11 17:46:00.000 200.0000 3 2
20130714 1 2013-07-14 23:30:00.000 300.0000 4 3

Solution Part B:
The following SQL performs a self-join using the New SequenceNumber field with an Offset to lookup the Next Record in the field that we just generated.

SELECTCurrSeq.AuditID AS StartAuditID,

CurrSeq.DateKey AS StartDateKey,

NextSeq.AuditID AS StopAuditID,

NextSeq.DateKey AS StopDateKey

FROM@CRMAudit AS crm

INNERJOIN @NewSequenceNumber AS CurrSeq on crm.AuditID = CurrSeq.AuditID

LEFTJOIN @NewSequenceNumber AS NextSeq ON CurrSeq.OpportunityID = NextSeq.OpportunityID

AND CurrSeq.SequenceNumber = NextSeq.SequenceNumber -1

This SQL code builds a pointer to the next update for all the changes for all the Opportunity records using a set-based operation.  It links records and provides information by telling us when to start filling in the gaps and when to stop filling in the gaps.

Here is the result of the data:

StartAuditID StartDateKey StartBookingTotal StopAuditID StopDateKey
1 20130709 100.0000 3 20130711
3 20130711 200.0000 4 20130714
4 20130714 300.0000 NULL NUL

In the table above, the data exists to know when to start and stop filling in the gaps.  On the first row with StartAuditID = 1, the StartDateKey field contains a value of 20130709 that is used to determine when record was created (on the 9th).  Adding one day to this value gives us the Start Date to start filling in the gaps.  The StopDateKey field contains a value of 20130711 that is used to determine when the Next record was created (on the 11th).  Subtracting one day to the StopDateKey value gives us the Stop Date to stop filling in the gaps.  Only one record is required to fill in the gap for this first row.  This new record must be created on the 10th with a BookingTotal of $100 dollars.

The second row has the StartAuditID = 3.  We can use this same logic to determine how many records are needed to fill in the gap.  With the StartDate on the 11th and the Stop Date on the 14th, there are two days where the Opportunity was not modified and these records must be created with a BookingTotal of $200 dollars.

Solution Part C:
Finally, the following code uses the Cross Join with the DateDim to fill in the gaps with the missing data and combines this dataset with the tblAudit data (transactional history plus the gaps) to provide a complete view of a snapshot dataset:

SELECT

       dd.DateKey,

       crm.OpportunityID,

       crm.ModifiedOn,

       crm.BookingTotal,

       crm.CRMAuditOrderID

FROM@CRMAudit AS crm

INNERJOIN @NewSequenceNumber AS CurrSeq on crm.AuditID = CurrSeq.AuditID

LEFTJOIN @NewSequenceNumber AS NextSeq ON CurrSeq.OpportunityID = NextSeq.OpportunityID

AND CurrSeq.SequenceNumber = NextSeq.SequenceNumber -1

CROSSJOIN DateDim dd

WHEREdd.DateKey > CurrSeq.DateKey

AND dd.DateKey < NextSeq.DateKey

UNION

SELECT

       crm.DateKey,

       crm.OpportunityID,

       crm.ModifiedOn,

       crm.BookingTotal,

       crm.CRMAuditOrderID

FROM@CRMAudit AS crm

INNERJOIN @MostRecentValueForDay AS mr ON crm.AuditID = mr.AuditID

Please note that in the table above the Stop AuditID and StopDateKey is NULL.  There is nothing for the last record to point to.  The last record in the series might require some special filtering logic in your scenario to know when to stop filling in the gaps. (You may want to stop filling in the gaps by using a WHERE clause when the Opportunity State field is ‘Closed’.)

This article helps frame up the challenges and provides a technical SQL solution for converting transactional history data to snapshot data in a relatively short period of time. Some areas to consider when converting to snapshot data are the granularity of the data you’ll need, ensuring high data quality, and making sure there are no gaps between record changes.

Do you have further questions or comments? Leave us your feedback below!

Phone: 312-602-4000
Email: marketing@westmonroepartners.com
222 W. Adams
Chicago, IL 60606
Show Buttons
Share On Facebook
Share On Twitter
Share on LinkedIn
Hide Buttons