Removing duplicates in a dataset

As most of you know, “data is only as good as its weakest record”. Or was it “The possession of facts is knowledge, the use of data is wisdom”? Anyway, data quality is one of the most important aspects of a dataset. But in practice, this is one of the first things that fails.

Data quality isn’t something you can create overnight, but it’s something that is achieved over a longer period of time. Because data is volatile, the quality of that data is becoming volatile itself. Who in your development- or administration-team is responsible for data quality? If you can’t think of anyone, you don’t have to be ashamed. In many companies data quality is underestimated.

 
Duplicates
One of the biggest mistakes that I saw at most companies I worked for, was data duplication. Developers created the database, forgot to put primary- and foreign keys on the tables, and they didn’t think of unique constraints. What do you get? Well, one of the least problems you’ll see is data duplication. Similar records being inserted twice or even more, without any constraint to prevent this.

I’m not going to tell you how to prevent duplicates, because that’s a whole other story. Start looking at primary keys, unique constraints, and (in most cases) the overall design of your database.

 
Removing duplicates
The main questions is: how do I get rid of duplicates, without trashing too much data. First of all, let’s create a small table:

CREATE TABLE #T1
  (Descr VARCHAR(10))

 
Now let’s insert some data, including some duplicates:

INSERT INTO #T1
  (Descr)
VALUES
  ('Value 1'),
  ('Value 2'),
  ('Value 2'),
  ('Value 3'),
  ('Value 4'),
  ('Value 5'),
  ('Value 6'),
  ('Value 6'),
  ('Value 6'),
  ('Value 7'),
  ('Value 8')

 
If you’d try to remove the duplicates with a normal DELETE statement, you can only delete both versions of the duplicate record. So how can we remove only one of the duplicates?

 
ROW_NUMBER to the rescue!
In order to delete one of the duplicates, we need to mark the “extra version” of the record. But how can we do this? The records are identical, right? In this case, we can use the function ROW_NUMBER to add a unique ID to the records.

ROW_NUMBER returns a sequential number for every record in the database, based on the grouping (Partition) and order you choose.

In the table we just created, you see that there are two duplicates for “Value 2″, and that the “Value 6″ record is even inserted three times. We want to completely remove these duplicates, and want to save only 1 record for each value.

Now we just add the ROW_NUMBER to our query:

SELECT
  Descr,
  ROW_NUMBER() OVER(PARTITION BY Descr
                    ORDER BY Descr)       AS RowNumber
FROM #T1

 
The result looks like this:

 
As you can see, every record has a unique number, based on the partition (which is basically a group by clause) of the “Descr” column. The order by is mandatory, but doesn’t affect the result in this example.

 
Removing the duplicates
Because ROW_NUMBER can only be used in the SELECT or ORDER BY statement, we can’t directly use it in a DELETE statement. But we can use a CTE to delete the duplicates:

WITH DeleteCTE AS
  (SELECT
      Descr,
      ROW_NUMBER() OVER(PARTITION BY Descr ORDER BY Descr) AS RowNumber
   FROM #T1)


DELETE D
FROM #T1 T
INNER JOIN DeleteCTE D
  ON D.Descr = T.Descr
WHERE D.RowNumber <> 1

 
ROW_NUMBER creates unique records for us, which we are able to delete in the following DELETE statement. We only want to keep one version, so we delete everything that has a “RowNumber” that isn’t 1.

The result of this looks like this:

 
Conclusion
Removing duplicates is something we can do very easily, and is only one single step in achieving data quality. There are many more steps to follow, before you have a trustworthy dataset, that is reliable any time of the day.

After removing the duplicates from your dataset, don’t forget to add primary- and foreign keys and unique constraint to prevent this situation in the future. Because running a script like this every week, isn’t going to solve your problems…

Recursive CTEs: Opening the door to a new dimension

A few weeks ago, someone asked me what a Common Table Expression (CTE) was, and how it works. But writing about a CTE would result in a short blog post. So why not make it a little more interesting, and also write about recursive CTE’s, right?

Since SQL Server 2005 we have the opportunity to use something called a Common Table Expression, or CTE for short. This T-SQL feature allows us to use the resultset of a query, as a source for another statement, just like a physical table. This sounds weird, but let me show you.

 
Creating the resources needed
To explain the possibilities of a CTE, let’s create a simple example with employee data. We’re going to create a table that contains the employee’s ID, name, job title and the employee ID of the manager:

CREATE TABLE dbo.Employees
    (EmployeeID INT,
	 EmployeeName VARCHAR(25),
     JobTitle VARCHAR(50),
     ManagerID INT)
 
 
INSERT INTO dbo.Employees
    (EmployeeID, JobTitle, EmployeeName, ManagerID)
VALUES
    (1, 'CEO', 'Smith', NULL),
    (2, 'Department Manager', 'Winters', 1),
    (3, 'Team Manager', 'Thompson', 2),
    (4, 'Employee', 'Jones', 3);

 
If you select all records from this table, it will look like this:

 
So as you can see, every employee has a manager, except the CEO. He’s at the top of the food chain and no one has a higher position than him. And looking at the “ordinary” employee, you’ll see that he has a manager above him, but no employees beneath him. He’s at the bottom of the food chain so to speak.

But what about the CTE? Let’s write an easy example to show you how to add the employees last names and job titles together, without doing that in your final select statement:

WITH EmployeeList AS
(
    SELECT
		EmployeeID,
        EmployeeName
        + ' - ' +
        JobTitle AS EmployeeNameAndJobTitle
    FROM dbo.Employees
)


SELECT
	EmployeeID,
	EmployeeNameAndJobTitle
FROM EmployeeList;

 
If we analyse the example above, the select query that adds the employees last name and job title together becomes a new resultset that is called “EmployeeList”, and can be used just like a normal table in the query below the CTE. The result looks like this”

 
But remember: the CTE just exists in the scope of your query. This means that you can only run one select statement on the CTE. If you try to run a second select statement, you’ll see that the CTE will no longer exist.

Another thing you can do with a CTE, is directly add the column names to the CTE like this:

WITH EmployeeList (EmployeeID, EmployeeNameAndJobTitle) AS
(
    SELECT
		EmployeeID,
        EmployeeName
        + ' - ' +
        JobTitle
    FROM dbo.Employees
)


SELECT
	EmployeeID,
	EmployeeNameAndJobTitle
FROM EmployeeList;

 
This might be easier to read for you and your colleagues, instead of naming your column in the select with an “AS EmployeeID”. The result is exactly the same as in the example above.

This is just an easy example, but imagine that you can use this to join your aggregated temporary dataset from your CTE, onto another table that you need to build a final resultset.

 
Recursive
But now comes the best part: recursive CTE’s. To stick with this example, what if you want to combine the job title of the employee with the manager above him? This sort of resultsets can be created with a recursive CTE. In the example below, the job title of the manager is added to the employees job title:

WITH HierarchyCTE (EmployeeID, ManagerID, JobTitle, Hierarchy)
AS
	(SELECT
		EmployeeID,
		ManagerID,
		JobTitle,
		CONVERT(VARCHAR(35), '')
	 FROM dbo.Employees
	 WHERE ManagerID IS NULL

	 UNION ALL

	 SELECT
		E.EmployeeID,
		E.ManagerID,
		E.JobTitle,
		CONVERT(VARCHAR(35), H.JobTitle + ' \ ' + E.JobTitle)
	 FROM dbo.Employees E
	 INNER JOIN HierarchyCTE H
		ON H.EmployeeID = e.ManagerID
)


SELECT
	EmployeeID,
	ManagerID,
	JobTitle,
	Hierarchy
FROM HierarchyCTE;

 
This is what the result looks like:

 
What happens in the recursive CTE, is that the first query is used as an anchor, and a loop is executed to add every child-record to the anchor (parent)-record. If there are no child-records left for the parent, the next parent is fetched, and so on.

 
So what can you do with this?
The examples above were just easy examples, but you can imagine the opportunities with this functionality. For example, if you work for an electronics store, you could build a long string of all packages with sub-deliveries (when every phone is send with a screen protector and case for example). Or create a complete overview of contents for recipes (for example, which ingredients are put into a pie). The possibilities are endless…

Row_Number: Unique ID in select statement

Last week I was asked by a colleague, Sander (Blog | @SanderKooij), “What’s the easiest way of adding a unique identifier (or row ID) to a result set?”. That’s an interesting question. There are several ways to do this. For example, you could insert the resultset into a temp table (physical or memory temp table), with an extra column (Identity). But what if you want to do this in a Select statement itself? Here’s a solution.

If you use the scripts I’ve added to this post (check Code Samples), you’ve created a table with country abbreviation codes. This table contains the data as shown below:

If you select the data, and want to add a record ID to your result set, you can use the following script:

SELECT
	DefaultAbbreviation,
	CountryProper,
	CountryNumber,
	ROW_NUMBER() OVER(ORDER BY DefaultAbbreviation ASC) AS RowNumber
FROM CountryAbbreviationCodes

The statement above uses the ROW_NUMBER built-in function to generate a row number for the result set. The ORDER BY clause in the functions is used to tell SQL Server what column(s) to use, to create a row number. In this case, the result set is ordered by the abbreviation column.

But what if you want to select the data with a specific row number? If you try to use the ROW_NUMBER function in the where clause, you get the following error:

“Windowed functions can only appear in the SELECT or ORDER BY clauses.”

So you need another solution. From SQL Server 2005 onwards we can use a Common Table Expression (CTE). With a CTE you can use a select statement as a table. So if you want to return the rows that are numbered 50 through 60, you can use the following query:

WITH OrderedCountries AS
(
	SELECT
		DefaultAbbreviation,
		CountryProper,
		CountryNumber,
		ROW_NUMBER() OVER(ORDER BY DefaultAbbreviation ASC) AS RowNumber
	FROM CountryAbbreviationCodes
)

SELECT
	DefaultAbbreviation,
	CountryProper,
	CountryNumber,
	RowNumber
FROM OrderedCountries
WHERE 1 = 1
AND RowNumber BETWEEN 50 AND 60

The result of this statement is the 10 rows we requested. So with a CTE we can use the ROW_NUMBER function to return specific rows from a table.

With the ROW_NUMBER functions, you can also delete duplicate records from your tables. If interested, post a comment, mail or tweet me, and I might write a blog post about it.


Code samples:
CreateSources.sql
SelectStatement.sql
CTESelectStatement.sql

Follow

Get every new post delivered to your Inbox.

Join 34 other followers