SSIS – Remove empty rows from Excel import

From the first time that I started SSIS, I started to love it. In most cases it’s easy to create a package, easy to understand, and even readable for people who don’t “speak fluent SQL”. But what if you want to perform an easy task, and the result isn’t what you expect?

 
Formatting an Excel sheet
One of the most basic tasks you can create in SSIS, is importing an Excel sheet. Most of the time this works like a charm. But in my case, I wanted to filter out some rows from the workbook.

The business delivers an Excel sheet, that needs to be imported into the database. But because they don’t have the technical knowledge we have, they don’t know how important the format of the file is. They sent us this file (I’ve created a smaller sample, so it’s easier to read and understand):

 
The first thing you’ll notice as a data-professional is the 2 empty rows in the sheet. Beside that, we have an employee without a name. You know this is going to cause problems when you see it. These errors are easy to spot in the example, but imagine if these 2 rows are hidden in a dataset with 50.000 or more rows. So even though they might ended up there accidentally, your process is going to fail.

 
When you add an “Excel Source” to your package, and you look at the preview of that import, you immediately see the problem:

 
Table structure
In order to determine what columns can be left blank, and what columns can’t be NULL, I looked at the table structure:

CREATE TABLE ResultSSIS
  (ID INT IDENTITY(1, 1),
   FullName VARCHAR(50) NOT NULL,
   Department VARCHAR(50) NULL,
   EmployeeNumber INT NOT NULL)

 
So in the dataset, FullName and EmpolyeeNumber are mandatory, and Department is optional. With this in mind, I started to work on a way to exclude those rows.

 
Import without excluding
The first thing I tried is to import the file, and see what the results are. Because I knew the data wasn’t correct, I didn’t want to import the Excel sheet into a SQL Server database just yet. So as a destination, I used the “recordset destination” control in SSIS. Importing the data into this memory table also allowed me to use the “data viewer” to see the imported data, without the need to truncate a table after each run. You can enable the “data viewer” by right-clicking the import-connector (the arrow between controls), and click “Enable Data Viewer”:

 
If you run the SSIS package in debugging mode, you’ll see the data that is imported in a pop-up window:

 
As you can see in the screenshot above, the records with NULL values in it are included in this import. So which records do we want to exclude, based on our table structure?

 
So from the 6 records in the Excel sheet, we want to exclude 3 in our import because of NULL values. But how do we do that? The easiest way to solve it, is to import it into a temp table, delete the NULL records, and insert the other records in the destination table. But what if that isn’t possible, and you want to filter the records in your import? I’ve chose to use the “Conditional Split”.

 
Conditional Split
You don’t have to rebuild your whole package, when you want to exclude records with the “Conditional Split”. You can just add this control, at least in this case, in between you source file and your destination. If you open the control, you can add an expression that is used to filter records. In my case, I wanted to exclude the rows with an empty “FullName” and “EmployeeNumber”:

 
When connecting your “Conditional Split” to your destination, SSIS will ask you what output the “Conditional Split” needs to return. To output the entire set without the empty rows, chose the “Conditional Split Default Output”:

 
When you run your package with the extra “Conditional Split” (and you enable Data Viewer again), you’ll see the filtered output of the “Conditional Split”. The 3 NULL records are excluded like expected:

 
Conclusion
SSIS is easy to use, and yet a a really powerful tool. Even if you build your processes in SSIS, it’s not always necessary to rebuild your whole package. Sometimes you can save the day with just a minor change. That’s the power of SSIS!

Recursive CTEs: Opening the door to a new dimension

A few weeks ago, someone asked me what a Common Table Expression (CTE) was, and how it works. But writing about a CTE would result in a short blog post. So why not make it a little more interesting, and also write about recursive CTE’s, right?

Since SQL Server 2005 we have the opportunity to use something called a Common Table Expression, or CTE for short. This T-SQL feature allows us to use the resultset of a query, as a source for another statement, just like a physical table. This sounds weird, but let me show you.

 
Creating the resources needed
To explain the possibilities of a CTE, let’s create a simple example with employee data. We’re going to create a table that contains the employee’s ID, name, job title and the employee ID of the manager:

CREATE TABLE dbo.Employees
    (EmployeeID INT,
	 EmployeeName VARCHAR(25),
     JobTitle VARCHAR(50),
     ManagerID INT)
 
 
INSERT INTO dbo.Employees
    (EmployeeID, JobTitle, EmployeeName, ManagerID)
VALUES
    (1, 'CEO', 'Smith', NULL),
    (2, 'Department Manager', 'Winters', 1),
    (3, 'Team Manager', 'Thompson', 2),
    (4, 'Employee', 'Jones', 3);

 
If you select all records from this table, it will look like this:

 
So as you can see, every employee has a manager, except the CEO. He’s at the top of the food chain and no one has a higher position than him. And looking at the “ordinary” employee, you’ll see that he has a manager above him, but no employees beneath him. He’s at the bottom of the food chain so to speak.

But what about the CTE? Let’s write an easy example to show you how to add the employees last names and job titles together, without doing that in your final select statement:

WITH EmployeeList AS
(
    SELECT
		EmployeeID,
        EmployeeName
        + ' - ' +
        JobTitle AS EmployeeNameAndJobTitle
    FROM dbo.Employees
)


SELECT
	EmployeeID,
	EmployeeNameAndJobTitle
FROM EmployeeList;

 
If we analyse the example above, the select query that adds the employees last name and job title together becomes a new resultset that is called “EmployeeList”, and can be used just like a normal table in the query below the CTE. The result looks like this”

 
But remember: the CTE just exists in the scope of your query. This means that you can only run one select statement on the CTE. If you try to run a second select statement, you’ll see that the CTE will no longer exist.

Another thing you can do with a CTE, is directly add the column names to the CTE like this:

WITH EmployeeList (EmployeeID, EmployeeNameAndJobTitle) AS
(
    SELECT
		EmployeeID,
        EmployeeName
        + ' - ' +
        JobTitle
    FROM dbo.Employees
)


SELECT
	EmployeeID,
	EmployeeNameAndJobTitle
FROM EmployeeList;

 
This might be easier to read for you and your colleagues, instead of naming your column in the select with an “AS EmployeeID”. The result is exactly the same as in the example above.

This is just an easy example, but imagine that you can use this to join your aggregated temporary dataset from your CTE, onto another table that you need to build a final resultset.

 
Recursive
But now comes the best part: recursive CTE’s. To stick with this example, what if you want to combine the job title of the employee with the manager above him? This sort of resultsets can be created with a recursive CTE. In the example below, the job title of the manager is added to the employees job title:

WITH HierarchyCTE (EmployeeID, ManagerID, JobTitle, Hierarchy)
AS
	(SELECT
		EmployeeID,
		ManagerID,
		JobTitle,
		CONVERT(VARCHAR(35), '')
	 FROM dbo.Employees
	 WHERE ManagerID IS NULL

	 UNION ALL

	 SELECT
		E.EmployeeID,
		E.ManagerID,
		E.JobTitle,
		CONVERT(VARCHAR(35), H.JobTitle + ' \ ' + E.JobTitle)
	 FROM dbo.Employees E
	 INNER JOIN HierarchyCTE H
		ON H.EmployeeID = e.ManagerID
)


SELECT
	EmployeeID,
	ManagerID,
	JobTitle,
	Hierarchy
FROM HierarchyCTE;

 
This is what the result looks like:

 
What happens in the recursive CTE, is that the first query is used as an anchor, and a loop is executed to add every child-record to the anchor (parent)-record. If there are no child-records left for the parent, the next parent is fetched, and so on.

 
So what can you do with this?
The examples above were just easy examples, but you can imagine the opportunities with this functionality. For example, if you work for an electronics store, you could build a long string of all packages with sub-deliveries (when every phone is send with a screen protector and case for example). Or create a complete overview of contents for recipes (for example, which ingredients are put into a pie). The possibilities are endless…

T-SQL Toolbelt – Search for objects in databases – V 2.1.0

A few weeks ago, I received a message from an old colleague and friend Eric (Blog | @saidin). He wanted to know if I had a query in my (and I quote) “magic bag of SQL tricks”, to search through object in SQL server. The company he works for (he is a software developer, and independant consultant) wanted to change all stored procedures, that contained functionality to calculate VAT (Value Added Tax).

I remembered that a few years ago, I needed that same functionality, and I wrote a few scripts to search for specific dependencies in views and stored procedures. Next to a query that gets information from
sys.tables and sys.columns, I used these queries to get view and stored procedure content:

SELECT *
FROM sys.syscomments
WHERE text LIKE '%<SearchTerm>%'


SELECT *
FROM INFORMATION_SCHEMA.VIEWS
WHERE VIEW_DEFINITION LIKE '%<SearchTerm>%'

The first query uses information from sys.syscomments. Which, according to MSDN:

“Contains entries for each view, rule, default, trigger, CHECK constraint, DEFAULT constraint, and stored procedure within the database. The text column contains the original SQL definition statements.

The seconds query uses INFORMATON_SCHEMA, that contains SQL Server metadata (see MSDN article):

An information schema view is one of several methods SQL Server provides for obtaining metadata.

The VIEWS view (a view on all views?) returns a row for each view that can be accessed by the current user, in the current database. So this means that the view only returns rows for objects that you have permissions on.

Then I decided to write a script that does this in one query, and more… When working on this script, I thought about adding more functionality to it. Why wouldn’t you want to search for primary or foreign key columns, triggers, functions, etc? But adding more information to the resultset often means that the overview is lost along the way. Because of that I created a switch system. By setting a few bits you can turn on what you need to see, and turn off what you don’t want to see. This way the result is kept clean, and you’re not bothered with unnecessary information.

One of the issues I ran into is how to search for a specific string. Because I wanted to let the user enter the searchterm once, I needed to use a variable. But if you use a variable, and you ad a wildcard (%) as the first and last character, the query returns all objects. It has the same effect as returning all objects, instead of returning objects based on a specific searchterm.

So because of this, I used dynamic SQL to search through the list of objects. In dynamic SQL it’s possible to work with wildcards in a like statement. The only thing I needed to do is change one more table from memory to physical temp table, because it’s used in the dynamic SQL. Apparently dynamic SQL can’t use a memory table (DECLARE @Object TABLE) as a datasource.

So this is what I could build in the past few weeks. The only problem is that fixing issues that I found resulted in adding more and more new functionality to the script. With that in mind, I want to create a toolbelt with useful SQL scripts for myself. But of course, I want to share it with the community, so they can use it if they like.

So the upcoming weeks, I hope to build as much functionality in this script as I can. There are still a few wishes for the future, and a few features that I want to build in, just because they’re cool! For every new version, I will write a blog with releasenotes, so you’re aware of the changes in the script.

For further questions and wishes, please contact me via twitter or this blog. I’d love to hear your ideas and wishes, so that I can implement it in the script!

You can download the script by clicking on the image below.

Downloads

Version 2.1.0:


Row_Number: Unique ID in select statement

Last week I was asked by a colleague, Sander (Blog | @SanderKooij), “What’s the easiest way of adding a unique identifier (or row ID) to a result set?”. That’s an interesting question. There are several ways to do this. For example, you could insert the resultset into a temp table (physical or memory temp table), with an extra column (Identity). But what if you want to do this in a Select statement itself? Here’s a solution.

If you use the scripts I’ve added to this post (check Code Samples), you’ve created a table with country abbreviation codes. This table contains the data as shown below:

If you select the data, and want to add a record ID to your result set, you can use the following script:

SELECT
	DefaultAbbreviation,
	CountryProper,
	CountryNumber,
	ROW_NUMBER() OVER(ORDER BY DefaultAbbreviation ASC) AS RowNumber
FROM CountryAbbreviationCodes

The statement above uses the ROW_NUMBER built-in function to generate a row number for the result set. The ORDER BY clause in the functions is used to tell SQL Server what column(s) to use, to create a row number. In this case, the result set is ordered by the abbreviation column.

But what if you want to select the data with a specific row number? If you try to use the ROW_NUMBER function in the where clause, you get the following error:

“Windowed functions can only appear in the SELECT or ORDER BY clauses.”

So you need another solution. From SQL Server 2005 onwards we can use a Common Table Expression (CTE). With a CTE you can use a select statement as a table. So if you want to return the rows that are numbered 50 through 60, you can use the following query:

WITH OrderedCountries AS
(
	SELECT
		DefaultAbbreviation,
		CountryProper,
		CountryNumber,
		ROW_NUMBER() OVER(ORDER BY DefaultAbbreviation ASC) AS RowNumber
	FROM CountryAbbreviationCodes
)

SELECT
	DefaultAbbreviation,
	CountryProper,
	CountryNumber,
	RowNumber
FROM OrderedCountries
WHERE 1 = 1
AND RowNumber BETWEEN 50 AND 60

The result of this statement is the 10 rows we requested. So with a CTE we can use the ROW_NUMBER function to return specific rows from a table.

With the ROW_NUMBER functions, you can also delete duplicate records from your tables. If interested, post a comment, mail or tweet me, and I might write a blog post about it.


Code samples:
CreateSources.sql
SelectStatement.sql
CTESelectStatement.sql

SQL Server Temp Tables

In SQL Server we know 3 kinds of temporary tables: Local Temp Tables, Global Temp Tables and Table Variables. In this blog I will try to explain the differences, and tell you how to use it.
 
Local Temp Tables
The local temp table is the most commonly used temp table. This temp tables is created with the following statement:

CREATE TABLE #TempTable
     (ID INT IDENTITY(1,1) NOT NULL,
      Description VARCHAR(10) NULL)

 
The table that the script above created is a temporary table that is stored on disk. To be exact, it’s stored in the TempDB. The table can only be reached from within the same scope. It will be cleaned automatically, but it’s more wise to clean it up yourself by using a DROP TABLE statement.

A session specific identifier is added to the name, to make sure that temp tables from other sessions don’t use the same name. If you query the sys.tables, you will see a table name similar to this:

#TempTable___________________________________________________________________

_______________________________________000000000005

The benefits of using a temp table are (amongst others): reduction of required locks (because the user is the only one who has access), less logging necessary, easy debugging (in some cases), etc. One downside is that the objectname is limited to 116 characters.
 
 
Global Temp Tables

Global temporary tables work just like local temporary tables (stored in TempDB, less locking necessary). However, they are visible to all sessions, until the session that created the object goes out of scope and all referring sessions to the object go out of scope. Also, the name of the object will remain the same as you declared it. So if you query sys.tables in the database, you will notice that (instead of the script above) the name is still TempTable.

A global temp table can be created with the following statement:

CREATE TABLE ##TempTable
     (ID INT IDENTITY(1,1) NOT NULL,
      Description VARCHAR(10) NULL) 

 
 
Most of the time you will not encounter these Global temp tables “in the wild”, because permanent tables are mostly preferred.
 
 

Table Variables

A temporary table can also be created in memory. This has several advantages: less locking then permanent objects, performs slightly better then temporary- or global tables, not as much logging necessary, etc. Table variables are cleared automatically when the procedure, function or query goes out of scope.

There are a few disadvantages and/or demands for using table variables:
    •They are allowed in SQL Server 2000 and above, and databases with Compatibility level 80 or higher
    •They cannot be used for “SELECT * INTO”-queries
    •They cannot be changed after declaration. They need to be recreated
    •They cannot be truncated
    •They cannot contain: User Defined Function (UDF), User Defined Type (UDT), Computed Column or Default      Constraint
    •They cannot be dropped. You need to let them go out of scope
    •They cannot be created dynamically (“SELECT * INTO @TempTable”-statement), or used in a dynamic SQL      statement
    •They are not included in statistics by SQL Sever, and you cannot create it manually
    •They don’t use parallelism when you use an “INSERT INTO”-statement
    •They will always have a cardinality of 1, because the table doesn’t exist at compile time. Cardinality refers to the      uniqueness of a column. The lower the cardinality, the more duplicated items in a column
    •They must be referenced by an alias if you join the object in a query. If you don’t the compiler will tell you the object      doesn’t exist

Conclusion
As always, there is no right or wrong answer. In most cases you will use the Table Variable or Local Temporary Table. The right answer in your case will be a judgment call between performance and usability. There are a few rules of thumb you can use to determine which type of object suites your needs:
    •If you have less then 100 rows, generally you want to use a Table Variable
    •If you need an index on your object, use a Temporary Table (Local or Global)

Follow

Get every new post delivered to your Inbox.

Join 41 other followers