I came across this query the other day, and it led to quite a bit of head scratching. The goal of the query was to de-duplicate data using a row ranking algorithm, however, every time I ran it it returned random and unpredictable results! Needless to say, I couldn’t keep running a faulty query, so I had to get to the bottom of it. Troubleshooting consisted of removing the row rank, and checking all the rows returned from the query. When looking at all the rows returned, I noticed that the ID that was ranked in the 1 spot for the row rank algorithm seemed to change between each run of the query. Since the original query only returns rows ranked in the 1 spot, Iwould receive inconsistent data every time it was executed. In this blog I’ll explain the issue behind the query in a simplified version of it, with a test table and test data that I created.
This query attempts to de-duplicate the data. It partitions the rows with similar keys and returns the rows in descending order ordered by the transaction dates. I then grab the top row, as it is the most relevant date that we want for the key. If there are two ‘C’s in the key column, I want to see the C with the latest transaction date. This will give you a successful de-duplication as you will only see one ‘C’ and one ID returned instead of multiple ‘C’s and multiple ID’s. When you run this query once, you get this:
Everything looks good here! However, let’s run it again just to make sure…
What happened? If you run it again you will also see totally different ID’s returned. The issue lies in the fact that there are keys that have the exact same transaction date. When you order these, the program just ends up returning a random row out of one of these many dates that are the same, giving a random and unpredictable ID returned from this partition. I have to make sure that what we order these on to de-duplicate the date must be unique. If I add in another element to the order by, created_date, I can have it be unique and will give a correct return.
When we run the query twice with this added, we will get this result:
So ‘C’ and ‘J’ are returning correctly now, but not ‘E’. Let’s look at the data in the table to see why this is.
‘C’ and ‘J’ now have unique created dates mixed in with the many rows of duplicated data. When the query runs, it will find the duplicated keys and transaction dates, but will grab the same row every time it runs due to the unique created date that we have also sorted on. ‘E’ still has duplicated data in the created date column, so it will continue to return random row ID’s. Created date must have unique data in order for this to work!
The moral of this story is when you run a query to de-duplicate by date, you need to make sure the combination of partition and order by is unique. Not accounting for this and letting a query like this run in your solutions can have a serious impact on the quality of your data. If you’re relying on this query to give you the most recent ID for your key, multiple runs of this query could give you multiple differing “most recent” ID’s which will then cause duplicates to be added to another table! Then you would have to de-duplicate that data too, or run the risk of causing a butterfly effect throughout your database solutions. Having an issue like this in your solution also causes issues for the development end during testing. When it’s time to dive into the code and test a query like this it will return unpredictable and random results, and cause the developer to click the execute button hundreds of times in confusion, wondering why this query functions as a random ID generator.