Removing duplicates in a dataset

As most of you know, “data is only as good as its weakest record”. Or was it “The possession of facts is knowledge, the use of data is wisdom”? Anyway, data quality is one of the most important aspects of a dataset. But in practice, this is one of the first things that fails.

Data quality isn’t something you can create overnight, but it’s something that is achieved over a longer period of time. Because data is volatile, the quality of that data is becoming volatile itself. Who in your development- or administration-team is responsible for data quality? If you can’t think of anyone, you don’t have to be ashamed. In many companies data quality is underestimated.

 
Duplicates
One of the biggest mistakes that I saw at most companies I worked for, was data duplication. Developers created the database, forgot to put primary- and foreign keys on the tables, and they didn’t think of unique constraints. What do you get? Well, one of the least problems you’ll see is data duplication. Similar records being inserted twice or even more, without any constraint to prevent this.

I’m not going to tell you how to prevent duplicates, because that’s a whole other story. Start looking at primary keys, unique constraints, and (in most cases) the overall design of your database.

 
Removing duplicates
The main questions is: how do I get rid of duplicates, without trashing too much data. First of all, let’s create a small table:

CREATE TABLE #T1
  (Descr VARCHAR(10))

 
Now let’s insert some data, including some duplicates:

INSERT INTO #T1
  (Descr)
VALUES
  ('Value 1'),
  ('Value 2'),
  ('Value 2'),
  ('Value 3'),
  ('Value 4'),
  ('Value 5'),
  ('Value 6'),
  ('Value 6'),
  ('Value 6'),
  ('Value 7'),
  ('Value 8')

 
If you’d try to remove the duplicates with a normal DELETE statement, you can only delete both versions of the duplicate record. So how can we remove only one of the duplicates?

 
ROW_NUMBER to the rescue!
In order to delete one of the duplicates, we need to mark the “extra version” of the record. But how can we do this? The records are identical, right? In this case, we can use the function ROW_NUMBER to add a unique ID to the records.

ROW_NUMBER returns a sequential number for every record in the database, based on the grouping (Partition) and order you choose.

In the table we just created, you see that there are two duplicates for “Value 2”, and that the “Value 6” record is even inserted three times. We want to completely remove these duplicates, and want to save only 1 record for each value.

Now we just add the ROW_NUMBER to our query:

SELECT
  Descr,
  ROW_NUMBER() OVER(PARTITION BY Descr
                    ORDER BY Descr)       AS RowNumber
FROM #T1

 
The result looks like this:

 
As you can see, every record has a unique number, based on the partition (which is basically a group by clause) of the “Descr” column. The order by is mandatory, but doesn’t affect the result in this example.

 
Removing the duplicates
Because ROW_NUMBER can only be used in the SELECT or ORDER BY statement, we can’t directly use it in a DELETE statement. But we can use a CTE to delete the duplicates:

WITH DeleteCTE AS
  (SELECT
      Descr,
      ROW_NUMBER() OVER(PARTITION BY Descr ORDER BY Descr) AS RowNumber
   FROM #T1)


DELETE D
FROM #T1 T
INNER JOIN DeleteCTE D
  ON D.Descr = T.Descr
WHERE D.RowNumber <> 1

 
ROW_NUMBER creates unique records for us, which we are able to delete in the following DELETE statement. We only want to keep one version, so we delete everything that has a “RowNumber” that isn’t 1.

The result of this looks like this:

 
Conclusion
Removing duplicates is something we can do very easily, and is only one single step in achieving data quality. There are many more steps to follow, before you have a trustworthy dataset, that is reliable any time of the day.

After removing the duplicates from your dataset, don’t forget to add primary- and foreign keys and unique constraint to prevent this situation in the future. Because running a script like this every week, isn’t going to solve your problems…

3 Responses to Removing duplicates in a dataset

  1. Sander Kooij says:

    I’m not a SQL guy (so this might be a stupid comment), but since you’re already using a temp table and not looking at the related content of the data. Wouldn’t it just be easier to select a distinct of all values, truncate the table with original values and insert the ‘distincted’ dataset?

    • DevJef says:

      You’re right. In this case I’m using a temp table without any relations to other tables, just as an example.

      But let’s look at a real datamodel: even if it’s possible to create a new table with only distinct values (which in many cases isn’t an option), you’re losing your statistics and indexes on the table. Maybe not physical, but your statistics will be stale. And re-creating or rebuilding indexes will cost you more time to fix, than you save using such a switch.

      Even though I’ve used the trick you mention several times, it might not always be the best way to fix your issue. And for this post, I took the example with the least impact.

      So your question is a very interesting one. Thank you for that! And I think I have a subject for the next blog post🙂

  2. Pingback: Write readable and high-performance queries with Window Functions | SQL from the Trenches

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: