Incremental updates with Change Data Capture
November 26, 2013 4 Comments
When designing a database or ETL process, for example loading your production data into a reporting environment, you always start your design with performance in mind. In the beginning of the project, your scripts and ETL run blazing fast. But after a few months in production, the entire project grinds to a halt. But how do you fix that problem, without a complete redesign of your applications and database? One of the many solutions is an easy one: incrementally load your data into the destination tables.
Change Data Capture
Incremental data loading could be a hard nut to crack. It’s not always an option, but it might be a good point to start from. One of the ways to start loading your data incrementally, is by using the keys in your database as a reference. If your table has a column called “Modified Date”, and that is updated every time the record is updated, you could use that. Every night, when the process runs, you just add the records that were modified after the last successful process run. But what if you don’t have that possibility? Change Data Capture (CDC) is an easy way out.
CDC is a way to record inserts, updates and deletes on a specific table, without the need of writing the triggers yourself. CDC reads the transaction log, and captures all changes made to the specific table. These changes are stored in the associated change table, that is created by CDC.
Below I’m going to show you how to setup your first table with CDC.If you would like to know more about CDC, this TechNet article is a place to start.
Create an example
To show you the basics of CDC, let start with creating a table called TestCDC in the database called Sandbox:
USE Sandbox GO CREATE TABLE dbo.TestCDC (ID int IDENTITY(1,1) NOT NULL PRIMARY KEY, Descr varchar(50) NULL) GO
Once you’ve created the table, turn on CDC at the database level, by execution the system stored procedure created to do that:
There is also a system stored procedure to enable CDC on the table level. You need to enable CDC on tables manually, and separately for every table you need:
EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'TestCDC', @role_name = NULL
If the SQL Server Agent is running on your machine or server, you’ll see this confirmation (I’ll explain later why SQL Server Agent is needed):
If the Agent isn’t running, you’ll see this warning:
If you ran the enable table statement, you will see that SQL Server created the system objects needed to track changes in the table:
Because CDC uses 2 SQL Server Agent jobs to capture and cleanup the change tables, you need to run the Agent to start the data capture. If the jobs aren’t running, SQL Server won’t capture any changes made:
Start data changes
In order to see what happens when you change data, let’s insert some records:
INSERT INTO dbo.TestCDC (Descr) VALUES ('This is a description') INSERT INTO dbo.TestCDC (Descr) VALUES ('This is a description too...')
And let’s update one of those 2 inserted records:
UPDATE dbo.TestCDC SET Descr = 'UPD - ' + Descr WHERE ID = 2
Now, let’s check the content of both the original table, and the change table:
/* Original table */ SELECT * FROM dbo.TestCDC /* Change table */ SELECT * FROM cdc.dbo_TestCDC_CT
If you run both queries, you’ll see the resultset below:
The records in the CDC change table allow you to update the data in your reporting environment. You could query them yourself, by retrieving all the changes since your last update. You can also use the procedures that return those changes for you, for example the cdc.fn_cdc_get_net_changes_. You can read more about the system function here.
Cleaning up after an update
Now that you’ve updated your reporting environment, it’s a wise thing to cleanup your CDC data. You could also drop the records yourself with a DELETE statement. Another option is using the system procedure for that: “sys.sp_cdc_cleanup_change_table”. You can clean your data using the following SQL statement:
DECLARE @Last_LSN VARBINARY(10) = (SELECT MAX(cdc.dbo_TestCDC_CT.[__$start_lsn]) FROM cdc.dbo_TestCDC_CT) EXEC sys.sp_cdc_cleanup_change_table @capture_instance = 'dbo_TestCDC', @low_water_mark = @Last_LSN, @threshold = 5000
The query will retrieve the last LSN (Log Sequence Number), and remove everything that happened before that.
Cleanup of CDC
If you want to completely remove CDC (because we’re done testing), you can disable it on the table level by running the query below:
EXEC sys.sp_cdc_disable_table @source_schema = 'dbo', @source_name = 'TestCDC', @capture_instance = 'dbo_TestCDC'
The statement will cleanup all the objects that were created to enable CDC on that specific table. But the statement will only stop the CDC on the specific table. The fastest way to disable CDC on all tables in the database, is disabling CDC on the database level, by running the query below:
Loading data always takes time, and there are many factors that are important: your database size, your frequency of changes, your ETL process, etc. The time it costs you to move data can be changed by rewriting your process to incremental loads. CDC is one of the many ways to achieve this. It works out of the box, and doesn’t require you to build any process yourself. But maybe your environment needs a custom process to operate the way you want it to. Not every feature in SQL Server is a so called silver bullet, but sometimes it comes darn close to one…