Write readable and high-performance queries with Window Functions

In SQL Server 2008, we gained a new and powerful feature in our bag of T-SQL tricks: Window Functions. The actual Window Function is the OVER clause, that allows you to determine partitions or “groups” in your query, before applying another function. In practice, this means you can create groups in your dataset, that can be molded further by applying functions to that groups.

In the past, I’ve written a number of posts about Window Functions in SQL Server:

Row_Number: Unique ID in select statement
Calculating Running Totals
Removing duplicates in a dataset

But there’s more!

 
Finding missing numbers in sequence
How many times did you look at a table, and you noticed that one of the values in a sequence is missing? Or a date range in a table, where someone deleted a few records. So all of a sudden, your year doesn’t have 365 days (or 366 in a leap year, or 365.242199 days which is the years average), but 315 days. There go your management reports that are calculated on a per day average. So how do you find the missing dates, without having to write a cursor or create a full blown date table like your BI colleagues do? You guessed right: a query with Window Function!

Let’s start with declaring a table, and insert a sequence with missing values:

DECLARE @Sequence TABLE
  (Number INT)


INSERT INTO @Sequence
  (Number)
VALUES
  (1),
  (10),
  (7),
  (4),
  (2),
  (8),
  (5)

 
So how can we get the “nearest numbers” from that table with just a single select statement?

SELECT
  LAG(Number, 1, 0) OVER (ORDER BY Number) AS LAG_Value,
  Number,
  LEAD(Number, 1, 0) OVER (ORDER BY Number) AS LEAD_Value
FROM @Sequence AS S
ORDER BY Number ASC

 
The LAG and LEAD are standard t-sql functions from sql server 2012 on. These functions give you the opportunity to access the previous or next row, without the need for a so called “self-join”. So what you see is the number, the record preceding that value (LAG) and the following value. So in this case, number 2 is preceded by Number 1, and followed by Number 4.

 
The lemonade stand
Now let’s look at another example. How about you? When you grew up, you wanted to save money for a new mobile phone right? In my case it was either a disc man, a Walkman, or a stereo set. But let’s stick with the modern equivalent of the Walkman for now: the MP3 player. So to earn money for the MP3 player, our fictitious friend who is called Joe, decides to start a lemonade stand. He needs to save up at least $150 to buy a new MP3 player. So every glass of lemonade he sells is accounted for, and at the end of the day he sums up all his
earnings, and puts it into a table:

DECLARE @Profit TABLE
  (DayNumber INT,
   Sales DECIMAL(10,2))


INSERT INTO @Profit
  (DayNumber, Sales)
VALUES
  (1,  6.90),
  (2,  4.17),
  (3,  2.69),
  (4,  7.26),
  (5,  2.93),
  (6,  8.98),
  (7,  7.25),
  (8,  5.88),
  (9,  1.51),
  (10, 7.97),
  (11, 3.44),
  (12, 3.76),
  (13, 9.96),
  (14, 0.92),
  (15, 8.28),
  (16, 6.05),
  (17, 9.40),
  (18, 4.03),
  (19, 9.14),
  (20, 7.25),
  (21, 0.06),
  (22, 9.12),
  (23, 7.39),
  (24, 6.57),
  (25, 4.54),
  (26, 0.09),
  (27, 4.42),
  (28, 9.53),
  (29, 5.09),
  (30, 0.89)

 
So as you can see, he earns quite a lot of money this way! But because he’s eager to buy his new MP3 player, he wants to see his day totals, and the amount he needs to buy his new toy. But because Joe is a really smart guy, he doesn’t want to do this with a lot of self-joins, and he wants his results fast. So looking at performance, what is the easiest way to query this data? How about this:

DECLARE @Goal DECIMAL(10,2) = 150.00


SELECT
  DayNumber,
  Sales,
  @Goal - SUM(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MoneyNeeded
FROM @Profit

 
He declares a “goal” variable, that is set to the amount he needs for his new gadget. So for every row, we calculate the sum of that day, plus all the sales he made from the first day (UNBOUNDED PRECEDING) and today (CURRENT ROW). After day 28 he has earned enough to buy his MP3 player. But now he wants to know what his average sales were. So he calculates the average of his sales, based on every sale he’s made so far:

SELECT
  DayNumber,
  Sales,
  AVG(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MoneyNeeded
FROM @Profit

 
So where does it stop?
So now that we’ve seen the SUM and AVERAGE option, what do we have left? How far can we take this? Thinking about it, how about a daily checkup if we hit a lowest or highest Sales amount? We can do this with the MIN and MAX option on the same query:

SELECT
  DayNumber,
  Sales,
  MIN(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS LowestSale
FROM @Profit


SELECT
  DayNumber,
  Sales,
  MAX(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS HighestSale
FROM @Profit

 
Now Joe can save his daily sales in the evening, and check if he had a very good, or a very bad day.

 
Not only for the average Joe…
So how can the business profit from all this? In some cases, Window Functions make it easier to output a certain resultset. In some cases it even gives you a whole new way to output data with a well performing (single) query, that was impossible until now. So if you’re running SQL Server 2008 or higher, start using (or at least start exploring) Window Functions right away!

 

To generate the random floats, I’ve used the generator of FYIcenter.com

If you want to read more about this topic, don’t forget to check out these blog posts:

- Julie Koesmarno: ABC Classification With SQL Server Window Function
- Mickey Stuewe: A Date At The End of The Month
- Chris Yates: Windows functions who knew

T-SQL Tuesday #50 – Automation: yea or nay

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Automation”. If you want to read the opening post, please click the image below to go to the party-starter: Hemanth D. (Blog | @SqlChow).



 
Being a DBA, you want to automate as many processes as you can, in order to save time that you can spend on more important things. But have you ever considered over-automating your processes?

 
We’re safe! What can go wrong here?
At one of the companies I worked for, they thought they had everything sorted out. Indexes were rebuild every day or every week (depended on the database), databases and logfiles were shrinked, databases were checked for corruption, backups were running, etc. They felt safe, knowing that if something happened they could anticipate on any situation SQL Server would throw at them. It would blow up in their faces eventually…

One of the first things I checked were the backups. The backup job was running, but only a few databases were actually selected for backup. And the biggest database (500+ GB), which was pretty important, was skipped because it took too long to backup. And guess what, they didn’t EVER test recovering from a backup, because of a lack of disk space and time. And there you have it: a false sense of safety!

I don’t have to tell you not to shrink your database and logfiles. Everybody knows that every time you shrink your database, a kitten dies… Or an index dies… Or the soul of your database… I’m not sure which one, but take your pick. It causes (and I quote Paul Randal (Blog | @PaulRandal) on this!): “*massive* index fragmentation”. Read more about that over at Paul’s blog. Besides that, if your next query needs more space in a data- or logfile you'll see more wait time because of file growth.

The indexes were rebuild every night on the important databases, and every weekend on less used databases. But they never checked if the problem they had before was fixed when switching to this solution.

Also the corruption check was run only on user databases. They never heard of running a corruption check on system databases. The system database were in the backup process, but they never took the time checked if they could restore them or were running a backup of a corrupted database.

 
Focus on the important stuff
So instead of automating all your processes, maybe you should focus on what’s really important. You just automated your backup process. But does it run every time? Are the backups actually written to disk? Can you restore one of the backups you created?

What I’m trying to say is, you can automate tasks whenever and wherever you like, but don’t forget to test them. Once you’ve automated something, plan regular tests to verify if the automated process runs the way you expect it to. And is the end result really the result you want and expect?

 
Don’t reinvent the wheel
Another tip is: don’t reinvent the wheel. There are more people that encountered the same issue, and wrote about it or logged about a solution. So before you build your own maintenance solution, or automate health status reports, check with your community members. There’s can be found help for every problem, but the checkup on that solution is all on you.

T-SQL Tuesday #49 – Wait for it…

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Wait for it…”. If you want to read the opening post, please click the image below to go to the party-starter: Robert Davis (Blog | @SQLSoldier).



 
Explaining developers how SQL Server works is something we all do. Maybe not on a daily basis, but you’re asked questions like “why is my index not working”, or “what’s the best way to add multiple columns to a table”. And most of the time, these questions lead to whole other bunch of questions you need to answer. And the one question we all are asked more than once: “why is my query running slow?”. So where do you start explaining?

 
Wait Types
There are lots and lots of wait types that can be found in SQL Server. In SQL Server 2005 there are 230 different wait types, 475 in SQL Server 2008 and 491 in SQL Server 2008 R2. In SQL Server 2012 they added another 197 new ones to the list. The wait types can be found by running this query:

SELECT wait_type
FROM sys.dm_os_wait_stats
ORDER BY wait_type ASC

 
These wait types can tell you what SQL Server is doing to execute your statement, and what the possible delays are. I’m not going to sum up all the wait types, but here’s a short list of common wait types you’ll see on your SQL server:

 
SOS_SCHEDULER_YIELD
Yielding processor time

LCK_M_*
Waiting for a lock

OLEDB
Wait on the OLEDB provider (Linked servers, Full-Text Search)

WRITELOG
Writing transaction log to disk

RESOURCE_SEMAPHORE
Waiting for a query memory grant

CXPACKET
Query parallelism

PAGEIOLATCH_*
Latch on a memory address while data is retrieved from disk

LAZYWRITER_SLEEP
System process waiting to start

 
All these different wait types could indicate a problem with your statement or the server. Some are more informative, while others show you a real issue. But what I really would like to show you, is how you can find these wait types.

 
DIY or just ask for help…
One of the ways to find the wait types on your SQL server, is to dive into the seemingly endless list of DMV’s. You could use the “sys.dm_exec_requests” and “sys.dm_os_waiting_tasks” DMV’s to find what you want, or you could take the easy way out: sp_WhoIsActive by Adam Machanic (Blog | @AdamMachanic ).

Adam (also the party starter of T-SQL Tuesday) wrote a no less than brilliant script to find problems on your server. But how does it work?

 
Installing sp_WhoIsActive
The “installation” of sp_WhoIsActive couldn’t be easier. You just need to download the script, and run it. This creates a stored procedure in the database of your choice. Usually I just create it in the master database. But if you have a DBA database with useful scripts, it’s okay to create it there.

 
Running it for the first time
The stored procedure can be executed without any parameters. That way, you use the default options. Just run the statement shown below:

EXEC master.dbo.sp_WhoIsActive

 
If you need it, or just like to see more information, you can also configure the procedure with a lot of parameters. If you want to see all the options you can configure, just set the documentation parameter to 1 (true):

EXEC master.dbo.sp_WhoIsActive
  @help = 1

 
The options
If you start using sp_WhoIsActive more and more, you’ll get your own set of favorite options. It all depends on the purpose you’re using the procedure for. Most of the time, I use it to determine why queries run slow, or why the performance of the SQL server is so low.

The information sp_WhoIsActive retrieves gives you a good indication of what SQL Server is doing, or what queries are bugging each other. I’ll list my favourite options below:

First, I set @show_own_spid on, so I can see my own query in the resultset.

The second option I love is @get_plans. This shows you the execution plans of the running queries:

 
Another great parameter to set is @get_outer_command. That way, you won’t just see the query running at the moment, but also the outer-command of the query (in the example below, the INSERT INTO the temp table is executed from within the stored procedure you see in the right column):

 
To see which transaction logs are used when running your query, set @get_transaction_info to 1:

 
Information regarding locks can be found, by setting @get_locks to 1:

 
If you click the XML, you’ll see which locks are granted, pending or denied:

 
The last option I’d like to set, is @get_additional_info. This will show you more information regarding the connection settings, session variables, etc:

 
Clicking the XML shows you the properties I mentioned above:

 
So this is what the query looks like, the way I personally like to use it:

EXEC master.dbo.sp_WhoIsActive
  @show_own_spid = 1,
  @get_plans = 1,
  @get_outer_command = 1,
  @get_transaction_info = 1,
  @get_locks = 1,
  @get_additional_info = 1

 
Conclusion
Wait types are your keys to open the door of the next level of SQL Server. Not all wait types are that easy to read and understand, but there are plenty of resources to be found online. For example, just take a look at the rest of the posts today. Most of the posts for T-SQL Tuesday can be found on Twitter, when you search for #TSQL2sDay.


I want to say thanks to the employees at Coeo for the easy explanation of some of the wait types!

Incremental updates with Change Data Capture

When designing a database or ETL process, for example loading your production data into a reporting environment, you always start your design with performance in mind. In the beginning of the project, your scripts and ETL run blazing fast. But after a few months in production, the entire project grinds to a halt. But how do you fix that problem, without a complete redesign of your applications and database? One of the many solutions is an easy one: incrementally load your data into the destination tables.

 
Change Data Capture
Incremental data loading could be a hard nut to crack. It’s not always an option, but it might be a good point to start from. One of the ways to start loading your data incrementally, is by using the keys in your database as a reference. If your table has a column called “Modified Date”, and that is updated every time the record is updated, you could use that. Every night, when the process runs, you just add the records that were modified after the last successful process run. But what if you don’t have that possibility? Change Data Capture (CDC) is an easy way out.

 
CDC is a way to record inserts, updates and deletes on a specific table, without the need of writing the triggers yourself. CDC reads the transaction log, and captures all changes made to the specific table. These changes are stored in the associated change table, that is created by CDC.

Below I’m going to show you how to setup your first table with CDC.If you would like to know more about CDC, this TechNet article is a place to start.

 
Create an example
To show you the basics of CDC, let start with creating a table called TestCDC in the database called Sandbox:

USE Sandbox
GO

CREATE TABLE dbo.TestCDC
  (ID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
	 Descr varchar(50) NULL)
GO

 
Once you’ve created the table, turn on CDC at the database level, by execution the system stored procedure created to do that:

EXEC sys.sp_cdc_enable_db

 
There is also a system stored procedure to enable CDC on the table level. You need to enable CDC on tables manually, and separately for every table you need:

EXEC sys.sp_cdc_enable_table
  @source_schema = 'dbo',
  @source_name = 'TestCDC',
  @role_name = NULL

 
If the SQL Server Agent is running on your machine or server, you’ll see this confirmation (I’ll explain later why SQL Server Agent is needed):

 
If the Agent isn’t running, you’ll see this warning:

 
If you ran the enable table statement, you will see that SQL Server created the system objects needed to track changes in the table:

 
Because CDC uses 2 SQL Server Agent jobs to capture and cleanup the change tables, you need to run the Agent to start the data capture. If the jobs aren’t running, SQL Server won’t capture any changes made:

 
Start data changes
In order to see what happens when you change data, let’s insert some records:

INSERT INTO dbo.TestCDC
  (Descr)
VALUES
  ('This is a description')

INSERT INTO dbo.TestCDC
  (Descr)
VALUES
  ('This is a description too...')

 
And let’s update one of those 2 inserted records:

UPDATE dbo.TestCDC
SET Descr = 'UPD - ' + Descr
WHERE ID = 2

 
Now, let’s check the content of both the original table, and the change table:

/* Original table */
SELECT * FROM dbo.TestCDC

/* Change table */
SELECT * FROM cdc.dbo_TestCDC_CT

 
If you run both queries, you’ll see the resultset below:

 
The records in the CDC change table allow you to update the data in your reporting environment. You could query them yourself, by retrieving all the changes since your last update. You can also use the procedures that return those changes for you, for example the cdc.fn_cdc_get_net_changes_. You can read more about the system function here.

 
Cleaning up after an update
Now that you’ve updated your reporting environment, it’s a wise thing to cleanup your CDC data. You could also drop the records yourself with a DELETE statement. Another option is using the system procedure for that: “sys.sp_cdc_cleanup_change_table”. You can clean your data using the following SQL statement:

DECLARE @Last_LSN VARBINARY(10) =
  (SELECT MAX(cdc.dbo_TestCDC_CT.[__$start_lsn]) FROM cdc.dbo_TestCDC_CT)

EXEC sys.sp_cdc_cleanup_change_table
  @capture_instance = 'dbo_TestCDC',
  @low_water_mark = @Last_LSN,
  @threshold = 5000

 
The query will retrieve the last LSN (Log Sequence Number), and remove everything that happened before that.

 
Cleanup of CDC
If you want to completely remove CDC (because we’re done testing), you can disable it on the table level by running the query below:

EXEC sys.sp_cdc_disable_table
  @source_schema = 'dbo',
  @source_name = 'TestCDC',
  @capture_instance = 'dbo_TestCDC'

 
The statement will cleanup all the objects that were created to enable CDC on that specific table. But the statement will only stop the CDC on the specific table. The fastest way to disable CDC on all tables in the database, is disabling CDC on the database level, by running the query below:

EXEC sys.sp_cdc_disable_db

 
Conclusion
Loading data always takes time, and there are many factors that are important: your database size, your frequency of changes, your ETL process, etc. The time it costs you to move data can be changed by rewriting your process to incremental loads. CDC is one of the many ways to achieve this. It works out of the box, and doesn’t require you to build any process yourself. But maybe your environment needs a custom process to operate the way you want it to. Not every feature in SQL Server is a so called silver bullet, but sometimes it comes darn close to one…

Data paging using offset

With every new release of SQL Server, we get to use new features that make our lives as developers and administrators so much easier. A few days ago, I came across an old piece of code (SQL 2005 if I remember correctly), that I used to page data for a CMS I build when I was still a web developer.

The company I worked for needed a new website, and wanted an HTML-editor to edit content on the website. This content was stored in SQL Server, and was retrieved by the website. With a few business rules we decided which content was visible, and which content was hidden from the website.

One of the features of the website was a news feed. But because there were so much news articles, we needed to show the top 10 articles on the first page and let the user click through to the next page of articles. But because we had so much news in the database, we needed to page this data. Every page should show 10 new articles, and we needed to switch pages for the rest of the news articles.

 
Creating the sample data
In order to show you the problem and solution, we need to create a sample table:

CREATE TABLE dbo.Paging
  (ID INT IDENTITY(1,1),
   Title VARCHAR(50),
   Content VARCHAR(50))

 
The test data we need to insert looks the same for every record:

INSERT INTO dbo.Paging
  (Title, Content)
VALUES
  ('This is an article', 'This is the content')
GO 50

 
This script will insert 50 record in the Paging table.

 
The old way
In older versions of SQL Server you needed to build your own solution to solve this problem. Let’s assume you clicked the news feed button on the website, and we want to switch to page 2 of the results. The solution I build back then looked something like this:

DECLARE @RowsToShow INT = 10,
        @RowsToSkip INT = 10


SELECT TOP(@RowsToShow)
  ID ,
  Title ,
  Content
FROM dbo.Paging
WHERE ID NOT IN
  (SELECT TOP(@RowsToSkip) ID FROM dbo.Paging)
ORDER BY ID ASC

 
In the query above, you can see we skip the first 10 rows, and retrieve the next 10 rows after that. That means that you need to remember on the website which records were retrieved already, and which records you want to see. The easiest way to do this, is by selecting the ID’s you’ve already shown, and retrieve the next set of rows.

This means you get execution plans like this:

 
The new way
From SQL Server 2012 onwards, we can use a new feature that is called OFFSET. That feature allows us to “window” our dataset, and retrieve a small subset of data without using a TOP and subquery, like in the example above. The new query would look like this:

SELECT
  ID ,
  Title ,
  Content
FROM dbo.Paging
ORDER BY ID ASC
OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY

 
In the query above, you can see an offset of 10 rows, and a fetch of 10 rows. This means that it skips the first 10 records, and retrieves the next 10 records after that. But how can you get this to work with dynamic resultsets and pages? This is one way to do it:

DECLARE @RowsPerPage INT = 10,
        @PageNumber INT = 2

SELECT
  ID ,
  Title ,
  Content
FROM dbo.Paging
ORDER BY ID ASC
OFFSET ((@PageNumber - 1) * @RowsPerPage) ROWS
FETCH NEXT @RowsPerPage ROWS ONLY

 
The offset is calculated by taking the @PageNumber parameter minus one to retrieve the page we want to see. If we wouldn’t do that, the offset would always skip the first 10 records. Then we multiply that number by the @RowsPerPage parameter, to calculate how many results we want to retrieve.

After that, we use the @RowsPerPage in the FETCH NEXT clause to retrieve the number of rows we want (in this case the next 10). This results in a completely different execution plan:

 
As you can see, this has a lot less impact on SQL Server. This becomes really visible if you compare both in SSMS:

 
I/O Costs
Comparing query costs is nice when you quickly compare 2 different approaches for the same solution, but in the end it all comes down to I/O costs. So which query is the fastest solution, and how are we going to test that?

First of all, we need to measure the I/O calls. We do that by using a DBCC command called DROPCLEANBUFFERS. This allows us “to test queries with a cold buffer cache without shutting down and restarting the server”. SO DON’T USE THIS IN PRODUCTION!!!

So the complete testscript looks like this:

SET NOCOUNT ON
SET STATISTICS IO ON


DBCC DROPCLEANBUFFERS

--==================================================
DECLARE @RowsToShow INT = 10,
        @RowsToSkip INT = 10


SELECT TOP(@RowsToShow)
  ID ,
  Title ,
  Content
FROM dbo.Paging
WHERE ID NOT IN
  (SELECT TOP(@RowsToSkip) ID FROM dbo.Paging)
ORDER BY ID ASC
--==================================================

DBCC DROPCLEANBUFFERS

--==================================================
DECLARE @RowsPerPage INT = 10,
        @PageNumber INT = 2

SELECT
  ID ,
  Title ,
  Content
FROM dbo.Paging
ORDER BY ID ASC
OFFSET ((@PageNumber - 1) * @RowsPerPage) ROWS
FETCH NEXT @RowsPerPage ROWS ONLY
--==================================================

So we clean the SQL Server buffers, run the first query, clean the buffers again, and run the second query. Now the effect of the new statement is really obvious, if you look at I/O costs:

 
So the old version of the query (with the sub-select) scans the table twice, and reads 51 pages from the cache. The new approach (with the OFFSET) scans the table only once, and reads only 1 page from the cache.

 
Conclusion
The less I/O calls SQL Server needs to retrieve the result from disk or cache, the faster your query will run. In this case, we’ve tuned the query from 51 pages read to 1 page read. And we’ve only tested this on a table with 50 records, but the network traffic is decreased significantly. So here’s a lot of performance improvement from only one new piece of functionality. And there is a lot more out there.

T-SQL Tuesday #43 – Hello, Operator?

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Hello, Operator?”. If you want to read the opening post, please click the image below to go to the party-starter: Rob Farley (Blog | @rob_farley).



 
Execution Plan
Everybody that ever got serious with SQL Server, knows about execution plans. And like Rob said in his opening post, if you don’t, start looking into that! But if you start working with execution plans, it gets confusing really fast. All those weird “building blocks” that tell you what SQL Server did with your query?! It almost looks like magic…!

But when you dive into it, they become easier to read and they suddenly are somewhat understandable. But hey, even after all those years looking at execution plans, some things still amaze me.

Last week I was asked to look at a query, that was used in an SSIS package. This package was one of the slowest from the nightly ETL processes. So I started of by looking at the execution plan:

 
Start pinpointing the first issue
Then I started digging, and I noticed this:

LEFT JOIN DB.dbo.T1 m
	ON m.Col1 = ISNULL(LOWER(t.Col1) COLLATE Latin1_General_CS_AS, '(N/A)')
LEFT JOIN DB.dbo.T2 s
	ON s.Col2 = ISNULL(LOWER(t.Col2) COLLATE Latin1_General_CS_AS, '(N/A)')
LEFT JOIN DB.dbo.T3 k
	ON k.Col3 = ISNULL(LOWER(t.Col3) COLLATE Latin1_General_CS_AS, '(N/A)')

 
But wait a minute… The collations we use aren’t Case Sensitive, so why use the LOWER() function anyway? And the collations are practically the same for the 2 databases used in the query (“SQL_Latin1_General_CP1_CI_AS” and “Latin1_General_CI_AS”). But the column collations are both the same (“SQL_Latin1_General_CP1_CI_AS”). SQL Server doesn’t throw an error if I join both sets, and the results aren’t different if I use COLLATE or not. So we learned another thing: the column collation is used in the ON clause of the JOIN. So both functions aren’t necessary because the collations are equal, and thus their wasting our precious CPU cycles…

Now here comes the moment you’ve all been waiting for. How does the execution plan differ? This is what the new execution plan looks like:

So what are we missing? You guessed right! We’re missing this bad boy:

So what does the “Compute Scalar” actually do? According to MSDN:

 

The Compute Scalar operator evaluates an expression to produce a computed scalar value

So that means it calculates a new value for every input column. This calculation can be as simple as a conversion of the value, or a concatenation of multiple values. Most of the time these “Compute Scalar” aren’t a major cost in your execution plan, but it can become an issue that uses a lot of CPU.

But what does this “Compute Scalar” do in this query? If I remove the ISNULL() from the query, it’s still there. It disappears if I remove the COLLATE. So in this case, the “Compute Scalar” is “calculating” a new value for every record. This “calculation” actually is a conversion from 1 collation to the other.

 
But what’s the difference?
If we compare both versions of the query, the modified version has a lot less impact on the server:

And this is only without the LOWER() and COLLATE functions. So with a little bit more effort, you can rewrite a part of the query, maybe add an index, etc. So as you can see, small changes can have a huge impact!

 
Conclusion
A “Compute Scalar” isn’t that hard to understand, but you’ll find it very often in your execution plans. So even though it isn’t the worst performance blocker in the world, it’s an interesting one to get familiar with.

If you don’t use SQL Sentry Plan Explorer yet, download your copy now!

Moving system database: rocket science or piece of cake?

Being a DBA often makes you the “Default Blame Acceptor”, according to Buck Woody (Website | @buckwoody). This means that everything is your fault by default. Server broke down? Your fault! Database corrupt? Your fault! Query of a user doesn’t compile because of a syntax error? Yeah, you guessed right… Your fault!

But on the other hand, you have a lot of opportunities to find out the best practices of doing things. An example of that is moving a system database. About two weeks ago we decided to order 4 SSD’s for our SQL Server. We plan to store tempdb and the SSAS data on these disks, hoping that it will reduce resource costs on our environment.

So with no experience of moving system databases, I started thinking about how to do this. You probably need to stop the SQL Server, move the MDF and LDF files, change the start-up options of SQL Server, start the service, hope that SQL Server finds the new location, etc. But after a quick peek I found a much simpler solution: just modify the current file location!

 
Check the current location and file sizes
Before moving your database (in this case I’m moving my tempdb), run the query below, and store the result just in case all goes south:

SELECT name, physical_name, state_desc, (size * 8 / 1024.00) AS InitialSize
FROM sys.master_files
WHERE database_id = DB_ID('tempdb');

 
The reason you also want to store the initial sizes, is that if you restart the SQL Service (one of the next steps), SQL Server will set the files to the default file sizes. And you don’t want to run on those default settings of course!

 
Set the new file location
You can set the new file location for your tempdb, by running the query below. In this example I’m moving my datafiles to the D:\ volume of my machine:

USE master
GO

ALTER DATABASE tempdb 
	MODIFY FILE (NAME = tempdev, FILENAME = 'D:\MSSQL\DATA\tempdb.mdf')
GO
ALTER DATABASE tempdb 
	MODIFY FILE (NAME = templog, FILENAME = 'D:\LOG\templog.ldf')
GO

 
After executing this statement, you’ll see a message like this appear in the Messages window:

 

The file “tempdev” has been modified in the system catalog. The new path will be used the next time the database is started.
The file “templog” has been modified in the system catalog. The new path will be used the next time the database is started.

 
So the file location is altered, but the running values are not changed until your machine is rebooted, or the SQL Service is restarted.

Now just restart the SQL Service (or the machine if you like to), and run the first query again. This way you can check if your tempdb is stored in the right folder, and if the initial sizes are correct:

SELECT name, physical_name, state_desc, (size * 8 / 1024.00) AS InitialSize
FROM sys.master_files
WHERE database_id = DB_ID('tempdb');

 
The service should stop and start without issues. After the restart you’ll see that SQL Server created a new MDF and LDF file at the new file location. After a successful restart, you can delete the MDF and LDF files from the old location.

 
Now, was that so hard?
So as you can see, not all changes in SQL Server are rocket science. One thing I’ve learned, is that from now on, I’m not going to assume the worst, and hope for the best!

 
UPDATE
As Pieter Vanhove (Blog | @Pieter_Vanhove) mentions in his tweets about msdb/model and master, in some cases you need to do a little bit more work. Because the tempdb is the database with the least probable cause of breaking SQL Server (it’s recreated if the SQL service starts), you can move it by changing the settings.

On the restart, the “Running values” (current settings) are overwritten by the “Configured values” (new settings) you set with the query you ran, and tempdb is recreated.

But the other system databases require a little bit more effort. If you want to move master, model or msdb, check out this link.

And thank you Pieter, for pointing out this stupid mishap to me!

Deadlock Detector: Drop it like it’s hot!

Last time I talked about the right tool for the right job. After that post, I noticed that during the nightly backups and job-runs, the SQL Server log filled up with deadlocks. Hoping to find the cause of this issue, I installed SQL Solutions Deadlock Detector. But I must warn you: it’s easier to install it, then to get rid of it!

 
Downloading it
The drama starts with downloading the tool. If you go to the download location of the tool, there’s no mention of a trial-version. They do like to tell you that it’s free… A lot of times… But guess what… It isn’t!

 
Installing it
After downloading the tool, I started the installation. Again, there’s no mention of any kind that I’ve just downloaded a trial version, and that you need to pay for the tool after 7 days.

 
Use it
After the installation I started the tool. The tool shows you for the first time that you’re actually running a triall:

After I clicked on “Try”, I was prompted to enter a SQL instance to run the tool on. After giving it my SQL Server instance location and name, it showed a pop-up really quick. Apparently it installed a bunch of crap on my SQL instance! Again, there is NO mention that it needs to install objects in your msdb! Also, SQL Server Agent needs to run on your instance, because Deadlock Detector can’t run without that.

So what does Deadlock Detector install? I’ve created a screenshot of all the objects that are installed:

 
Uninstalling it
Okay, the tool wasn’t what I was looking for, so I wanted to uninstall it. But how do I get rid of all those objects that were created in my msdb database?

First I uninstalled the tool, hoping that it would also drop all the objects it created. But at the end of the uninstall, it sent me to this webpage. So I’m suppost to download a special uninstaller to get rid of your installer?

Okay, so I downloaded the special installer-uninstaller, and ran it. Close, but no cigar… The tool ran, returned no errors, but the objects still existed. Running it again, ended with the same results.

To check it yourself, run this query:

SELECT *
FROM msdb.LakeSideLockLogger._LakeSide_DbTools_LockLog

If it’s still running you’ll see record appear that tell you that there were deadlocks on your system. Even though you closed the tool and, like me, expect it to shutdown, the services still run.

To check if the objects still exist on your server, run this query:

SELECT *
FROM msdb.sys.objects
WHERE 1 = 1
AND is_ms_shipped = 0
AND Name LIKE '_LakeSide%'

So what do you do then? Right, I sent the company a support call. I’ve waited more than a week by now, but still no reply. Not even a reply that they’ve received my mail. So a really, really bad service of SQL Solutions if you ask me…

So after a lot of searching, I saw a really small reference to another uninstall method: from the tool itself. So I’ve re-installed the tool, and tried that:

And that worked for me!

 
Conclusion
Apparently more people had issues with uninstalling the tool, looking at results like this.

The thing I learned from this, is to dig a little bit deeper if I look at a new tool. The website of the tool’s manufacturer might not be the best source to find out how a tools functions…

 
UPDATE
In the mean while, I found some more objects that aren’t uninstalled by Deadlock Detector: Operators and Alerts. Even though you’ve “uninstalled” the tool, objects, queues, etc, there are still some leftovers: 1 operator, and 3 alerts.

You can drop them by running the script below:

--Drop Operator
EXEC msdb.dbo.sp_delete_operator @name=N'Deadlock Detector – Default operator'
GO

--Drop Alerts
EXEC msdb.dbo.sp_delete_alert @name=N'Deadlock Detector – Process is killed'
GO

EXEC msdb.dbo.sp_delete_alert @name=N'Deadlock Detector – Lock is detected'
GO

EXEC msdb.dbo.sp_delete_alert @name=N'Deadlock Detector – Deadlock is detected'
GO

Enriching your dataset – What are your technical options?

In my previous post I tried to explain how to enrich your dataset, so you can start using spatial data. You’ve seen that there are free options, or you can buy a bunch of files with the information you need.

This time I’ll try to explain the technical option you have. The options all contain the Google Geocoding API, because this is free, and it’s fairly accurate. But this will also work for the files you buy from (for example) GfK GeoMarketing, which I did business with once.

Write your own application
One of the easiest an quickest options for me was to write an application to fetch my data from the Google API. I’m not a die-hard programmer, but I know my way around C# as far as I need to. So with a little help from my colleagues, I managed to write a Windows Forms application that calls the Geocoding API. The results of the API call are either just shown on screen (in a GridView), or exported to Excel (using LinqToExcel and ExcelExporter).

Another option I intend to build in, is the export from and to a SQL Server database. But because this is a project I work on in my own time, this could take a couple of weeks.

CLR
One of the other options I’ve found, is a CLR that calls the API. The CLR you write, is basically a .NET application that you load into SQL Server. It’s more complicated than that, but to keep this story moving on, I’ll leave it at that.

One of the many downsides of using a CLR, is your local DBA. Every DBA I’ve encountered in my life told me: “Don’t use a CLR, EVER!!! Because that’s the biggest security hole you can open up on a SQL Server”. To be completely honest, I’m not that familiar with CLR’s, but I guess it isn’t as easy as that. There might be some pros and cons regarding that…

SSIS Package
In my quest to find all possible options, I found this great article by Donabel Santos (Blog | @sqlbelle). In her article she describes how you can create an SSIS package that fetches a Lat/Long for every record in the dataset you use as input.

Talking about this with Koen Verbeeck (Blog | @Ko_Ver) on Twitter made me realize that for normal businesses, this might be the best option. In every company I know, there’s always a server running SQL Server Integration Services that you can use to execute such a package.

So, what’s your choice?
Looking at the options above, I’m guessing the option you choose depends on the company you work for. I guess that smaller companies would choose the SSIS package over building a tool. But if your company has a development department with a bunch of software developers, writing your own tool might be a better option. In that case writing your own tool gives you more flexibility, because not only your DBA can debug an error in an SSIS package, but every developer can debug the tool they’ve written.

If you’ve got some .NET/C# knowledge, and you have the time to dive into CLR’s, that might be your best option. CLR’s are loaded into SQL Server, and are (as far as I can tell) blazing fast. So if you’re looking for performance, and don’t want external tools or packages to fetch your data, go for it!

Geometry vs Geography

In the last few months I often received the question: “What datatype do I need to use? Geometry or Geography?”. The answer I gave them, is the most used answer in the IT world: “It depends…”. Because the answer isn’t as easy as you might think…

 
Storage
The storage of both Geometry and Geography is the same. Even though they’re functionality is different, the system datatypes are the same:

SELECT *
FROM sys.types
WHERE system_type_id = 240

The content you will find in a Geometry or Geography column is binary. But by using a number of different methods, you can easily convert this to readable text.

 
Performance
Looking at performance, it could be an easy choice: Geometry will win this. The basic polygons aren’t that different. The extended methods on the other hand, do make a big difference! This is a whole different world, so I’ll write an article about this in the upcoming weeks.

Bob Beauchemin (Blog | @bobbeauch) also gave a session about this, and he had some great examples to illustrate this. You might want to watch his Spatial Data sessions!

 
Functionality
If you’re looking for the biggest difference between the two datatypes, you’ll see it’s the functionality. A Geometry object is just a 2D, flat polygon. This means, it doesn’t matter if you take a country on top of the earth (for example Canada, which is “curved”), or a country close to the Equator (for example Brazil, which is “flat”).

A Geography object on the other hand, is a 3D (or even 4D) polygon, that has the same curve as the shape of the earth. That means that the difference between 2 point isn’t calculated in a straight line, but it’s need to consider the curvature of the earth.

A great example of this concept, is shown by Morten Nielsen (Blog | @dotMorten) on his blog. He made a tool (Shape2Sql, downloadable from his his blog), that can be used to import shape files into SQL Server.

He explains the difference between Geometry and Geography with 2 pictures, where he draws a straight line between Europe and the USA:


Geometry


Geography

As you can see, the straight line is actually a straight line if you use Geometry (planar system). But if you use Geography (spherical system), the line isn’t straight but curved. This could become an issue if you want to know the distance between Berlin and Los Angeles, and you use Geometry data to calculate this. If you use that to calculate the amount of fuel for your plane, you might end up swimming the last part of your journey!

Popping The big question
The big question you need to ask yourself is: “want do I want to do with this?”. If you want to use it just for visualization purposed, you might want to stick with Geometry data. It’s accurate, fast, and doesn’t require difficult calculations. But if you need to calculate distances across the globe, or data that represents the earth, then Geography is the way to go.

Follow

Get every new post delivered to your Inbox.

Join 34 other followers