Using FOR XML in a CROSS APPLY

A few blog posts ago, I showed how easy it is to generate JSON, JavaScript or arrays with the FOR XML clause. But in the examples, the output is transformed to text. Selecting this output is the same as a normal select. But what if you really want to output XML? Does this work the same? Let’s create the same example table again, and rebuild it to output XML.

 
Creating resources
Just as in my last blog post, let’s use the airline example:

CREATE TABLE #TMP_AirlineTickets
  (ID INT IDENTITY(1,1),
   Airline VARCHAR(10),
   Departure_Airport_Code VARCHAR(5),
   Price DECIMAL(10,4))


INSERT INTO #TMP_AirlineTickets
  (Airline, Departure_Airport_Code, Price)
VALUES
  ('BA', 'RTM', 10.00),
  ('KLM', 'AMS', 125.00),
  ('BA', 'LHR', 15.00),
  ('KLM', 'BCN', 50.00),
  ('KLM', 'BHX', 75.00)

 
When you write a query with a CROSS APPLY on this table, it works like it’s supposed to:

SELECT AT.Airline, CA.Departure_Airport_Code
FROM #TMP_AirlineTickets AT
CROSS APPLY
	(
		SELECT Departure_Airport_Code
		FROM #TMP_AirlineTickets A
		AND A.Airline = AT.Airline
	) CA
GROUP BY AT.Airline, CA.Departure_Airport_Code

 
There is a DISTINCT or GROUP BY needed in this query, to return only the unique records. As we’ll see later on, DISTINCT doesn’t work for this query, so a GROUP BY is used. This results in a dataset where every Airline code is joined to every airport they fly to:

 
Generating XML in CROSS APPLY
But in the previous post, we got a string as output. But now, we want to return XML instead of a normal resultset. But what happens if we put the FOR XML clause in this query? If we do so, it throws an exception:

SELECT AT.Airline, CA.Departure_Airport_Code
FROM #TMP_AirlineTickets AT
CROSS APPLY
	(
		SELECT Departure_Airport_Code
		FROM #TMP_AirlineTickets A
		AND A.Airline = AT.Airline
		FOR XML PATH('')
	) CA
GROUP BY AT.Airline, CA.Departure_Airport_Code

 

 
Apparently column 1 of the CROSS APPLY result has no column name. Looking at the result that comes from the CROSS APPLY, the column name is generated automatically:

 
Giving a column alias to the column in the select doesn’t work. But you can also provide a column list to a cross apply:

SELECT AT.Airline, CONVERT(XML, CA.Airport_XML) AS Airport_XML
FROM #TMP_AirlineTickets AT
CROSS APPLY
    (
        SELECT Departure_Airport_Code AS DAC
        FROM #TMP_AirlineTickets A
        AND A.Airline = AT.Airline
        FOR XML PATH('')
    ) CA (Airport_XML)
GROUP BY AT.Airline, CA.Airport_XML

 
By adding a column list to the CROSS APPLY, we can now use the column name Airport_XML in the outer select. By doing that, we get the airline name, and the departure airport code list:

 
Returning XML
So when you want to generate XML in a query, it’s possible to do that with a CROSS APPLY. It’s just a easy as writing a normal query, but it makes you a lot more flexible when you need to output XML from SQL Server. With this, there is no need for CLR’s or external applications to generate the XML you need, but you can just do it with T-SQL.

Using OUTPUT clause to track changes

Whenever something changes in a database, you would like to know what has changed. This change might brake your database or application later, so all schema or server changes are monitored. But how about data changes? Do you monitor those? You don’t have to feel bad if you don’t, because most people don’t track those changes. But you can track these changes in a very easy way, by using the output clause in your query. Let’s take a look at how that works.

 
Create resources
As example, let’s take a look at a bank account. If you have a bank account, there’s some money in it, and you spend money which will be subtracted from your account balance. First of all, let’s make two tables: one for the account balance, and one for logging the changes:

CREATE TABLE #Account
  (ID INT IDENTITY(1,1),
   Balance FLOAT)


CREATE TABLE #Logging
  (DEL_ID INT,
   DEL_Balance FLOAT,
   UPD_ID INT,
   UPD_Balance FLOAT,
   Balance_Difference FLOAT)

 
To keep it simple, the account table has just one column, and that column contains the current balance of the account. Let’s insert a balance to start with:

INSERT INTO #Account
  (Balance)
VALUES
  (250.25)

 
If we run an update query now, the account balance will change. But do we still know what the original balance was? Not without logging. So let’s see how the output clause can change this:

UPDATE #Account
SET Balance = 125.00
OUTPUT Deleted.ID,
       Deleted.Balance,
       Inserted.ID,
       Inserted.Balance,
	   Inserted.Balance - Deleted.Balance
INTO #Logging

 
The balance table now only contains the new value. But by using the output clause, that logged the changes we made, we can check the logging table, and find the old value. When looking at the output clause, you can use the temporary objects called Deleted and Inserted. The Deleted table contains the old values (before the change), and the Inserted contains the new values (the values after the change). When checking the logging table, you’ll see both the old and the new values:

 
So as you can see, the logging table contains both the old and the new version of the record. And this not only works for update queries, but also for Inserts, Deletes and Merge-statements. This gives you the opportunity to log changes in the queries you think need it. Just think about what you want to log, create a log table, and add an output clause to your query. It’s not always the case, but with this, it’s as easy as that.

 
If you want to read more about CRUD, don’t forget to check out these blog posts:

- Julie Koesmarno: On sabbatical
Mickey Stuewe: Going Beyond The INSERT Statement
Chris Yates: C-R-U-D The Basics

Using FOR XML to create arrays, JSON and JavaScript

Every new project brings its own challenges and pitfalls. New projects also give you a chance to be creative in finding solutions, and try new things. This keeps it interesting, and allows you to learn new things. And sometimes, the things you avoided for all these years, turn out to be your best option.

 
A missing feature in SQL Server
For a project, we needed to generate JSON or JavaScript from T-SQL. Even though the whole world seems to use JSON at this point, there is no feature to generate JSON from T-SQL or SQL Server. There are more than enough people who would like such a feature (just look at the Connect item made for this). Maybe they’ll put it in the vNext of SQL Server, who knows. But at this moment, we need to generate the JSON ourselves.

 
Generating the data array
First of all, we need to generate an array or list. The problem with this, is the number of rows. For example, if you only have 2 rows per product, you can join the table onto itself, and concatenate the columns like that. But what if you have more rows per product, or you don’t know how many rows per product are returned?

One of the ways to resolve this, is by using the FOR XML clause. You can use this in a CROSS APPLY, to create an array/list of data, grouped by another column.

So let’s create an example, by using airlines. When you fly to another state, country or continent, you’ll book a flight with an airline. Those airlines have a name, a departure airport code, and of course a ticket price. Let’s put this data in a table:

CREATE TABLE #TMP_AirlineTickets
  (ID INT IDENTITY(1,1),
   Airline VARCHAR(10),
   Departure_Airport_Code VARCHAR(5),
   Price DECIMAL(10,4))


INSERT INTO #TMP_AirlineTickets
  (Airline, Departure_Airport_Code, Price)
VALUES
  ('BA', 'RTM', 10.00),
  ('KLM', 'AMS', 125.00),
  ('BA', 'LHR', 15.00),
  ('KLM', 'BCN', 50.00),
  ('KLM', 'BHX', 75.00)

 
When we query this data, we could easily use a sub-select to query. But because I want my query to be readability, and good performing, I used the FOR XML clause:

SELECT
  AT.Airline,
  STUFF((SELECT
            CONCAT(',', SUB.Departure_Airport_Code)
         FROM #TMP_AirlineTickets SUB
         WHERE SUB.Airline = AT.Airline
         ORDER BY SUB.Departure_Airport_Code ASC
         FOR XML PATH('')
        ), 1, 1, '') AS Array_Airport_Codes
FROM #TMP_AirlineTickets AT
WHERE 1 = 1
GROUP BY AT.Airline

 
What happens in this query? The FOR XML clause creates an XML for the airline that is selected in the outer query. So a query like this is executed:

SELECT
  SUB.Departure_Airport_Code
FROM #TMP_AirlineTickets SUB
WHERE SUB.Airline = 'KLM'
ORDER BY SUB.Departure_Airport_Code ASC
FOR XML PATH('')

 
The result from that query is an XML column:

 
By concatenating a character to that XML column, the XML nodes are removed. This gives us the possibility to use this as an array:

SELECT
  CONCAT(',', SUB.Departure_Airport_Code)
FROM #TMP_AirlineTickets SUB
WHERE SUB.Airline = 'KLM'
ORDER BY SUB.Departure_Airport_Code ASC
FOR XML PATH('')

 
The result of this statement looks like this:

 
The next step is to use this in a query that creates an XML column for all airlines in the table:

SELECT
  AT.Airline,
  (SELECT
            CONCAT(',', SUB.Departure_Airport_Code)
         FROM #TMP_AirlineTickets SUB
         WHERE SUB.Airline = AT.Airline
         ORDER BY SUB.Departure_Airport_Code ASC
         FOR XML PATH('')
        ) AS Array_Airport_Codes
FROM #TMP_AirlineTickets AT
WHERE 1 = 1
GROUP BY AT.Airline

 
The result is an array of airport departure codes:

 
As you can see, this creates the array/list for all airlines in the table. With STUFF I replace the first character, because I don’t want my result to start with a comma. This can be done with a RIGHT clause, that USES LEN – 1 to strip of the first character on the left, but this looks better in my opinion:

SELECT
  AT.Airline,
  STUFF((SELECT
            CONCAT(',', SUB.Departure_Airport_Code)
         FROM #TMP_AirlineTickets SUB
         WHERE SUB.Airline = AT.Airline
         ORDER BY SUB.Departure_Airport_Code ASC
         FOR XML PATH('')
        ), 1, 1, '') AS Array_Airport_Codes
FROM #TMP_AirlineTickets AT
WHERE 1 = 1
GROUP BY AT.Airline

 
The result looks the same, except the first comma is gone:

 
One of the best things when creating demos like this for a blog post, is that you can make it as unrealistic as you want. What I mean by that, is I’m making it easy to generate XML, in order to keep the examples clear and understandable. But what if you want to use this in the real world? You’re probably seeing datasets that aren’t as easy to use as the demo above. The strings you want to concatenate are filled with special characters, etc. Luckily there’s a solution for that. Rob Farley (Blog | @rob_farley) wrote a really nice blog post about that.

 
So now we have our array/list ready, but how do we convert this to JSON or JavaScript? The main difference (sorry if I insult my developer friends with this statement), is that in JavaScript everything needs to be enclosed in double quotes (“). This isn’t needed if you create JSON. In the following example, I’ll create a JSON object. But because just airport codes isn’t what we need, I’ll adjust the query so it’ll add the prices to the object as well. This is done by an additional CONCAT in the XML select, which concatenates the airport code and the price into 1 string:

SELECT
  AT.Airline,
  CONCAT('{"ap":[',
         STUFF(
               (SELECT
                    CONCAT(',{"', SUB.Departure_Airport_Code, '":', SUB.Price, '}')
                FROM #TMP_AirlineTickets SUB
                WHERE SUB.Airline = AT.Airline
                ORDER BY SUB.Departure_Airport_Code ASC
                FOR XML PATH('')
                ), 1, 1, '')
         , ']}') AS JSON
FROM #TMP_AirlineTickets AT
WHERE 1 = 1
GROUP BY AT.Airline

 
The GROUP BY can be replaced by a DISTINCT. But either one of those is needed to make sure you’re only returning unique records:

 

 
By parsing the second result in the query in an online tool, you can see that the result is valid JSON:

 

 
By adding some double quotes in the script above, you can easily generate JavaScript from T-SQL. But if you just want to use the FOR XML to generate arrays, this will also work perfectly.

 
Why would you want to use this?
Generating JSON or JavaScript might come in handy for webdevelopers, that can get data from a database, and use that directly on their website. By using the FOR XML clause to generate JSON, you make sure your queries are flexible enough to generate valid JSON, even if your dataset grows. Generating arrays like this can also come in handy for developers that need data from their databases. Instead of creating an array in code, they can now use the array that is returned from the database.

So even though I tried to work my way around using FOR XML, it turned out to be the best option thinkable!

Write readable and high-performance queries with Window Functions

In SQL Server 2008, we gained a new and powerful feature in our bag of T-SQL tricks: Window Functions. The actual Window Function is the OVER clause, that allows you to determine partitions or “groups” in your query, before applying another function. In practice, this means you can create groups in your dataset, that can be molded further by applying functions to that groups.

In the past, I’ve written a number of posts about Window Functions in SQL Server:

Row_Number: Unique ID in select statement
Calculating Running Totals
Removing duplicates in a dataset

But there’s more!

 
Finding missing numbers in sequence
How many times did you look at a table, and you noticed that one of the values in a sequence is missing? Or a date range in a table, where someone deleted a few records. So all of a sudden, your year doesn’t have 365 days (or 366 in a leap year, or 365.242199 days which is the years average), but 315 days. There go your management reports that are calculated on a per day average. So how do you find the missing dates, without having to write a cursor or create a full blown date table like your BI colleagues do? You guessed right: a query with Window Function!

Let’s start with declaring a table, and insert a sequence with missing values:

DECLARE @Sequence TABLE
  (Number INT)


INSERT INTO @Sequence
  (Number)
VALUES
  (1),
  (10),
  (7),
  (4),
  (2),
  (8),
  (5)

 
So how can we get the “nearest numbers” from that table with just a single select statement?

SELECT
  LAG(Number, 1, 0) OVER (ORDER BY Number) AS LAG_Value,
  Number,
  LEAD(Number, 1, 0) OVER (ORDER BY Number) AS LEAD_Value
FROM @Sequence AS S
ORDER BY Number ASC

 
The LAG and LEAD are standard t-sql functions from sql server 2012 on. These functions give you the opportunity to access the previous or next row, without the need for a so called “self-join”. So what you see is the number, the record preceding that value (LAG) and the following value. So in this case, number 2 is preceded by Number 1, and followed by Number 4.

 
The lemonade stand
Now let’s look at another example. How about you? When you grew up, you wanted to save money for a new mobile phone right? In my case it was either a disc man, a Walkman, or a stereo set. But let’s stick with the modern equivalent of the Walkman for now: the MP3 player. So to earn money for the MP3 player, our fictitious friend who is called Joe, decides to start a lemonade stand. He needs to save up at least $150 to buy a new MP3 player. So every glass of lemonade he sells is accounted for, and at the end of the day he sums up all his
earnings, and puts it into a table:

DECLARE @Profit TABLE
  (DayNumber INT,
   Sales DECIMAL(10,2))


INSERT INTO @Profit
  (DayNumber, Sales)
VALUES
  (1,  6.90),
  (2,  4.17),
  (3,  2.69),
  (4,  7.26),
  (5,  2.93),
  (6,  8.98),
  (7,  7.25),
  (8,  5.88),
  (9,  1.51),
  (10, 7.97),
  (11, 3.44),
  (12, 3.76),
  (13, 9.96),
  (14, 0.92),
  (15, 8.28),
  (16, 6.05),
  (17, 9.40),
  (18, 4.03),
  (19, 9.14),
  (20, 7.25),
  (21, 0.06),
  (22, 9.12),
  (23, 7.39),
  (24, 6.57),
  (25, 4.54),
  (26, 0.09),
  (27, 4.42),
  (28, 9.53),
  (29, 5.09),
  (30, 0.89)

 
So as you can see, he earns quite a lot of money this way! But because he’s eager to buy his new MP3 player, he wants to see his day totals, and the amount he needs to buy his new toy. But because Joe is a really smart guy, he doesn’t want to do this with a lot of self-joins, and he wants his results fast. So looking at performance, what is the easiest way to query this data? How about this:

DECLARE @Goal DECIMAL(10,2) = 150.00


SELECT
  DayNumber,
  Sales,
  @Goal - SUM(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MoneyNeeded
FROM @Profit

 
He declares a “goal” variable, that is set to the amount he needs for his new gadget. So for every row, we calculate the sum of that day, plus all the sales he made from the first day (UNBOUNDED PRECEDING) and today (CURRENT ROW). After day 28 he has earned enough to buy his MP3 player. But now he wants to know what his average sales were. So he calculates the average of his sales, based on every sale he’s made so far:

SELECT
  DayNumber,
  Sales,
  AVG(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MoneyNeeded
FROM @Profit

 
So where does it stop?
So now that we’ve seen the SUM and AVERAGE option, what do we have left? How far can we take this? Thinking about it, how about a daily checkup if we hit a lowest or highest Sales amount? We can do this with the MIN and MAX option on the same query:

SELECT
  DayNumber,
  Sales,
  MIN(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS LowestSale
FROM @Profit


SELECT
  DayNumber,
  Sales,
  MAX(Sales) OVER(ORDER BY DayNumber
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS HighestSale
FROM @Profit

 
Now Joe can save his daily sales in the evening, and check if he had a very good, or a very bad day.

 
Not only for the average Joe…
So how can the business profit from all this? In some cases, Window Functions make it easier to output a certain resultset. In some cases it even gives you a whole new way to output data with a well performing (single) query, that was impossible until now. So if you’re running SQL Server 2008 or higher, start using (or at least start exploring) Window Functions right away!

 

To generate the random floats, I’ve used the generator of FYIcenter.com

If you want to read more about this topic, don’t forget to check out these blog posts:

- Julie Koesmarno: ABC Classification With SQL Server Window Function
Mickey Stuewe: A Date At The End of The Month
Chris Yates: Windows functions who knew

T-SQL Tuesday #50 – Automation: yea or nay

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Automation”. If you want to read the opening post, please click the image below to go to the party-starter: Hemanth D. (Blog | @SqlChow).



 
Being a DBA, you want to automate as many processes as you can, in order to save time that you can spend on more important things. But have you ever considered over-automating your processes?

 
We’re safe! What can go wrong here?
At one of the companies I worked for, they thought they had everything sorted out. Indexes were rebuild every day or every week (depended on the database), databases and logfiles were shrinked, databases were checked for corruption, backups were running, etc. They felt safe, knowing that if something happened they could anticipate on any situation SQL Server would throw at them. It would blow up in their faces eventually…

One of the first things I checked were the backups. The backup job was running, but only a few databases were actually selected for backup. And the biggest database (500+ GB), which was pretty important, was skipped because it took too long to backup. And guess what, they didn’t EVER test recovering from a backup, because of a lack of disk space and time. And there you have it: a false sense of safety!

I don’t have to tell you not to shrink your database and logfiles. Everybody knows that every time you shrink your database, a kitten dies… Or an index dies… Or the soul of your database… I’m not sure which one, but take your pick. It causes (and I quote Paul Randal (Blog | @PaulRandal) on this!): “*massive* index fragmentation”. Read more about that over at Paul’s blog. Besides that, if your next query needs more space in a data- or logfile you'll see more wait time because of file growth.

The indexes were rebuild every night on the important databases, and every weekend on less used databases. But they never checked if the problem they had before was fixed when switching to this solution.

Also the corruption check was run only on user databases. They never heard of running a corruption check on system databases. The system database were in the backup process, but they never took the time checked if they could restore them or were running a backup of a corrupted database.

 
Focus on the important stuff
So instead of automating all your processes, maybe you should focus on what’s really important. You just automated your backup process. But does it run every time? Are the backups actually written to disk? Can you restore one of the backups you created?

What I’m trying to say is, you can automate tasks whenever and wherever you like, but don’t forget to test them. Once you’ve automated something, plan regular tests to verify if the automated process runs the way you expect it to. And is the end result really the result you want and expect?

 
Don’t reinvent the wheel
Another tip is: don’t reinvent the wheel. There are more people that encountered the same issue, and wrote about it or logged about a solution. So before you build your own maintenance solution, or automate health status reports, check with your community members. There’s can be found help for every problem, but the checkup on that solution is all on you.

T-SQL Tuesday #49 – Wait for it…

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Wait for it…”. If you want to read the opening post, please click the image below to go to the party-starter: Robert Davis (Blog | @SQLSoldier).



 
Explaining developers how SQL Server works is something we all do. Maybe not on a daily basis, but you’re asked questions like “why is my index not working”, or “what’s the best way to add multiple columns to a table”. And most of the time, these questions lead to whole other bunch of questions you need to answer. And the one question we all are asked more than once: “why is my query running slow?”. So where do you start explaining?

 
Wait Types
There are lots and lots of wait types that can be found in SQL Server. In SQL Server 2005 there are 230 different wait types, 475 in SQL Server 2008 and 491 in SQL Server 2008 R2. In SQL Server 2012 they added another 197 new ones to the list. The wait types can be found by running this query:

SELECT wait_type
FROM sys.dm_os_wait_stats
ORDER BY wait_type ASC

 
These wait types can tell you what SQL Server is doing to execute your statement, and what the possible delays are. I’m not going to sum up all the wait types, but here’s a short list of common wait types you’ll see on your SQL server:

 
SOS_SCHEDULER_YIELD
Yielding processor time

LCK_M_*
Waiting for a lock

OLEDB
Wait on the OLEDB provider (Linked servers, Full-Text Search)

WRITELOG
Writing transaction log to disk

RESOURCE_SEMAPHORE
Waiting for a query memory grant

CXPACKET
Query parallelism

PAGEIOLATCH_*
Latch on a memory address while data is retrieved from disk

LAZYWRITER_SLEEP
System process waiting to start

 
All these different wait types could indicate a problem with your statement or the server. Some are more informative, while others show you a real issue. But what I really would like to show you, is how you can find these wait types.

 
DIY or just ask for help…
One of the ways to find the wait types on your SQL server, is to dive into the seemingly endless list of DMV’s. You could use the “sys.dm_exec_requests” and “sys.dm_os_waiting_tasks” DMV’s to find what you want, or you could take the easy way out: sp_WhoIsActive by Adam Machanic (Blog | @AdamMachanic ).

Adam (also the party starter of T-SQL Tuesday) wrote a no less than brilliant script to find problems on your server. But how does it work?

 
Installing sp_WhoIsActive
The “installation” of sp_WhoIsActive couldn’t be easier. You just need to download the script, and run it. This creates a stored procedure in the database of your choice. Usually I just create it in the master database. But if you have a DBA database with useful scripts, it’s okay to create it there.

 
Running it for the first time
The stored procedure can be executed without any parameters. That way, you use the default options. Just run the statement shown below:

EXEC master.dbo.sp_WhoIsActive

 
If you need it, or just like to see more information, you can also configure the procedure with a lot of parameters. If you want to see all the options you can configure, just set the documentation parameter to 1 (true):

EXEC master.dbo.sp_WhoIsActive
  @help = 1

 
The options
If you start using sp_WhoIsActive more and more, you’ll get your own set of favorite options. It all depends on the purpose you’re using the procedure for. Most of the time, I use it to determine why queries run slow, or why the performance of the SQL server is so low.

The information sp_WhoIsActive retrieves gives you a good indication of what SQL Server is doing, or what queries are bugging each other. I’ll list my favourite options below:

First, I set @show_own_spid on, so I can see my own query in the resultset.

The second option I love is @get_plans. This shows you the execution plans of the running queries:

 
Another great parameter to set is @get_outer_command. That way, you won’t just see the query running at the moment, but also the outer-command of the query (in the example below, the INSERT INTO the temp table is executed from within the stored procedure you see in the right column):

 
To see which transaction logs are used when running your query, set @get_transaction_info to 1:

 
Information regarding locks can be found, by setting @get_locks to 1:

 
If you click the XML, you’ll see which locks are granted, pending or denied:

 
The last option I’d like to set, is @get_additional_info. This will show you more information regarding the connection settings, session variables, etc:

 
Clicking the XML shows you the properties I mentioned above:

 
So this is what the query looks like, the way I personally like to use it:

EXEC master.dbo.sp_WhoIsActive
  @show_own_spid = 1,
  @get_plans = 1,
  @get_outer_command = 1,
  @get_transaction_info = 1,
  @get_locks = 1,
  @get_additional_info = 1

 
Conclusion
Wait types are your keys to open the door of the next level of SQL Server. Not all wait types are that easy to read and understand, but there are plenty of resources to be found online. For example, just take a look at the rest of the posts today. Most of the posts for T-SQL Tuesday can be found on Twitter, when you search for #TSQL2sDay.


I want to say thanks to the employees at Coeo for the easy explanation of some of the wait types!

Incremental updates with Change Data Capture

When designing a database or ETL process, for example loading your production data into a reporting environment, you always start your design with performance in mind. In the beginning of the project, your scripts and ETL run blazing fast. But after a few months in production, the entire project grinds to a halt. But how do you fix that problem, without a complete redesign of your applications and database? One of the many solutions is an easy one: incrementally load your data into the destination tables.

 
Change Data Capture
Incremental data loading could be a hard nut to crack. It’s not always an option, but it might be a good point to start from. One of the ways to start loading your data incrementally, is by using the keys in your database as a reference. If your table has a column called “Modified Date”, and that is updated every time the record is updated, you could use that. Every night, when the process runs, you just add the records that were modified after the last successful process run. But what if you don’t have that possibility? Change Data Capture (CDC) is an easy way out.

 
CDC is a way to record inserts, updates and deletes on a specific table, without the need of writing the triggers yourself. CDC reads the transaction log, and captures all changes made to the specific table. These changes are stored in the associated change table, that is created by CDC.

Below I’m going to show you how to setup your first table with CDC.If you would like to know more about CDC, this TechNet article is a place to start.

 
Create an example
To show you the basics of CDC, let start with creating a table called TestCDC in the database called Sandbox:

USE Sandbox
GO

CREATE TABLE dbo.TestCDC
  (ID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
	 Descr varchar(50) NULL)
GO

 
Once you’ve created the table, turn on CDC at the database level, by execution the system stored procedure created to do that:

EXEC sys.sp_cdc_enable_db

 
There is also a system stored procedure to enable CDC on the table level. You need to enable CDC on tables manually, and separately for every table you need:

EXEC sys.sp_cdc_enable_table
  @source_schema = 'dbo',
  @source_name = 'TestCDC',
  @role_name = NULL

 
If the SQL Server Agent is running on your machine or server, you’ll see this confirmation (I’ll explain later why SQL Server Agent is needed):

 
If the Agent isn’t running, you’ll see this warning:

 
If you ran the enable table statement, you will see that SQL Server created the system objects needed to track changes in the table:

 
Because CDC uses 2 SQL Server Agent jobs to capture and cleanup the change tables, you need to run the Agent to start the data capture. If the jobs aren’t running, SQL Server won’t capture any changes made:

 
Start data changes
In order to see what happens when you change data, let’s insert some records:

INSERT INTO dbo.TestCDC
  (Descr)
VALUES
  ('This is a description')

INSERT INTO dbo.TestCDC
  (Descr)
VALUES
  ('This is a description too...')

 
And let’s update one of those 2 inserted records:

UPDATE dbo.TestCDC
SET Descr = 'UPD - ' + Descr
WHERE ID = 2

 
Now, let’s check the content of both the original table, and the change table:

/* Original table */
SELECT * FROM dbo.TestCDC

/* Change table */
SELECT * FROM cdc.dbo_TestCDC_CT

 
If you run both queries, you’ll see the resultset below:

 
The records in the CDC change table allow you to update the data in your reporting environment. You could query them yourself, by retrieving all the changes since your last update. You can also use the procedures that return those changes for you, for example the cdc.fn_cdc_get_net_changes_. You can read more about the system function here.

 
Cleaning up after an update
Now that you’ve updated your reporting environment, it’s a wise thing to cleanup your CDC data. You could also drop the records yourself with a DELETE statement. Another option is using the system procedure for that: “sys.sp_cdc_cleanup_change_table”. You can clean your data using the following SQL statement:

DECLARE @Last_LSN VARBINARY(10) =
  (SELECT MAX(cdc.dbo_TestCDC_CT.[__$start_lsn]) FROM cdc.dbo_TestCDC_CT)

EXEC sys.sp_cdc_cleanup_change_table
  @capture_instance = 'dbo_TestCDC',
  @low_water_mark = @Last_LSN,
  @threshold = 5000

 
The query will retrieve the last LSN (Log Sequence Number), and remove everything that happened before that.

 
Cleanup of CDC
If you want to completely remove CDC (because we’re done testing), you can disable it on the table level by running the query below:

EXEC sys.sp_cdc_disable_table
  @source_schema = 'dbo',
  @source_name = 'TestCDC',
  @capture_instance = 'dbo_TestCDC'

 
The statement will cleanup all the objects that were created to enable CDC on that specific table. But the statement will only stop the CDC on the specific table. The fastest way to disable CDC on all tables in the database, is disabling CDC on the database level, by running the query below:

EXEC sys.sp_cdc_disable_db

 
Conclusion
Loading data always takes time, and there are many factors that are important: your database size, your frequency of changes, your ETL process, etc. The time it costs you to move data can be changed by rewriting your process to incremental loads. CDC is one of the many ways to achieve this. It works out of the box, and doesn’t require you to build any process yourself. But maybe your environment needs a custom process to operate the way you want it to. Not every feature in SQL Server is a so called silver bullet, but sometimes it comes darn close to one…

Follow

Get every new post delivered to your Inbox.

Join 53 other followers