Removing duplicates in a dataset
October 21, 2013 3 Comments
As most of you know, “data is only as good as its weakest record”. Or was it “The possession of facts is knowledge, the use of data is wisdom”? Anyway, data quality is one of the most important aspects of a dataset. But in practice, this is one of the first things that fails.
Data quality isn’t something you can create overnight, but it’s something that is achieved over a longer period of time. Because data is volatile, the quality of that data is becoming volatile itself. Who in your development- or administration-team is responsible for data quality? If you can’t think of anyone, you don’t have to be ashamed. In many companies data quality is underestimated.
One of the biggest mistakes that I saw at most companies I worked for, was data duplication. Developers created the database, forgot to put primary- and foreign keys on the tables, and they didn’t think of unique constraints. What do you get? Well, one of the least problems you’ll see is data duplication. Similar records being inserted twice or even more, without any constraint to prevent this.
I’m not going to tell you how to prevent duplicates, because that’s a whole other story. Start looking at primary keys, unique constraints, and (in most cases) the overall design of your database.
The main questions is: how do I get rid of duplicates, without trashing too much data. First of all, let’s create a small table:
CREATE TABLE #T1 (Descr VARCHAR(10))
Now let’s insert some data, including some duplicates:
INSERT INTO #T1 (Descr) VALUES ('Value 1'), ('Value 2'), ('Value 2'), ('Value 3'), ('Value 4'), ('Value 5'), ('Value 6'), ('Value 6'), ('Value 6'), ('Value 7'), ('Value 8')
If you’d try to remove the duplicates with a normal DELETE statement, you can only delete both versions of the duplicate record. So how can we remove only one of the duplicates?
ROW_NUMBER to the rescue!
In order to delete one of the duplicates, we need to mark the “extra version” of the record. But how can we do this? The records are identical, right? In this case, we can use the function ROW_NUMBER to add a unique ID to the records.
ROW_NUMBER returns a sequential number for every record in the database, based on the grouping (Partition) and order you choose.
In the table we just created, you see that there are two duplicates for “Value 2”, and that the “Value 6” record is even inserted three times. We want to completely remove these duplicates, and want to save only 1 record for each value.
Now we just add the ROW_NUMBER to our query:
SELECT Descr, ROW_NUMBER() OVER(PARTITION BY Descr ORDER BY Descr) AS RowNumber FROM #T1
The result looks like this:
As you can see, every record has a unique number, based on the partition (which is basically a group by clause) of the “Descr” column. The order by is mandatory, but doesn’t affect the result in this example.
WITH DeleteCTE AS (SELECT Descr, ROW_NUMBER() OVER(PARTITION BY Descr ORDER BY Descr) AS RowNumber FROM #T1) DELETE D FROM #T1 T INNER JOIN DeleteCTE D ON D.Descr = T.Descr WHERE D.RowNumber <> 1
ROW_NUMBER creates unique records for us, which we are able to delete in the following DELETE statement. We only want to keep one version, so we delete everything that has a “RowNumber” that isn’t 1.
The result of this looks like this:
Removing duplicates is something we can do very easily, and is only one single step in achieving data quality. There are many more steps to follow, before you have a trustworthy dataset, that is reliable any time of the day.
After removing the duplicates from your dataset, don’t forget to add primary- and foreign keys and unique constraint to prevent this situation in the future. Because running a script like this every week, isn’t going to solve your problems…