Find SQL Server Agent job history retention with T-SQL

In SQL Server, the Job Agent has its own retention that you can set. This retention determines how much execution history of SQL Server Agent jobs is being kept. But if you have a lot of instances across your domain, or if you reinstall a new server, how do you determine if this value is set correctly?

 
Doing it old-school: via the GUI
You can check the SQL Server agent retention via the GUI, by right-clicking on the SQL Server Agent, and clicking on properties. In the window that popped-up, click on History, and you’ll see something like this:

 
T-SQL / Registry
The way I wanted to check it is with T-SQL. Because I want to run the statement against a 40+ instances, I use a multi-server query for that. The reason I do that, is that this returns all the results in a single table, which makes reviewing and comparing the results a lot easier.

But there doesn’t seem to be a way to query the set values. But with a little bit of searching, I found out that you use the stored procedure below to change one of the values:

EXEC msdb.dbo.sp_set_sqlagent_properties @jobhistory_max_rows = 999

 
So by dissecting this procedure, I found out SQL Server just reads the value from a registry key. I took out the piece that I need to read the retention values:

DECLARE @jobhistory_max_rows INT = NULL,
        @jobhistory_max_rows_per_job INT = NULL


EXECUTE master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE',
                                        N'SOFTWARE\Microsoft\MSSQLServer\SQLServerAgent',
                                        N'JobHistoryMaxRows',
                                        @jobhistory_max_rows OUTPUT,
                                        N'no_output'
SELECT @jobhistory_max_rows = ISNULL(@jobhistory_max_rows, -1)


EXECUTE master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE',
                                        N'SOFTWARE\Microsoft\MSSQLServer\SQLServerAgent',
                                        N'JobHistoryMaxRowsPerJob',
                                        @jobhistory_max_rows_per_job OUTPUT,
                                        N'no_output'


SELECT @jobhistory_max_rows, @jobhistory_max_rows_per_job

 
But when I wanted to look up the values in the registry, to see if it returned the correct results, I couldn’t find the actual keys. I thought I was losing my mind, but there’s a catch…

 
xp_instance_regread vs xp_regread
There are 2 stored procedures you can use to read registry keys from your SQL Server:

DECLARE @jobhistory_max_rows INT = NULL

EXECUTE master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE',
                                        N'SOFTWARE\Microsoft\MSSQLServer\SQLServerAgent',
                                        N'JobHistoryMaxRows',
                                        @jobhistory_max_rows OUTPUT,
                                        N'no_output'

SELECT @jobhistory_max_rows

 
or

DECLARE @jobhistory_max_rows INT = NULL

EXECUTE master.dbo.xp_regread N'HKEY_LOCAL_MACHINE',
                                        N'SOFTWARE\Microsoft\MSSQLServer\SQLServerAgent',
                                        N'JobHistoryMaxRows',
                                        @jobhistory_max_rows OUTPUT,
                                        N'no_output'

SELECT @jobhistory_max_rows

 
The difference between the two is perfectly explained in this Stack Overflow thread:

xp_regread reads the literal registry path that you specify. xp_instance_regread “converts” the path you specify so that it matches the instance of SQL Server that you’re currently using.

 
So if you run multiple instances on your machine, you want to use the instance version of the stored procedure. If you’re just running 1 (default) instance, you can use both versions.

So eventually, I searched for the instance folder, and I did find the registry keys:

 
Conclusion
Especially for very large environments, it’s a good idea to write system- and/or server-checks in T-SQL. That way, you can script them in SCOM (or another monitoring tool), or if you use SQL Server Policy Based Management (and I would really like to recommend this Pluralsight course from a very good friend of mine: Boris Hristov (Blog | @BorisHristov)!), or just a multi-server query.

Advertisements

Avoid a big SSISDB by logging less…?

Last week I blogged about how logging can grind your SSIS to a halt. After posting the blog, I got a really interesting reaction from Koen Verbeeck (Blog | @Ko_Ver):

I knew there were different logging levels in SSIS, but I couldn’t recall the difference in levels. So I discussed about the difference with Koen. At first glance we thought that the biggest difference is the fact that performance logging only logs warnings and errors. So basically, you decrease the amount of messages logged, which should results in a slightly better performance. But is that really the case?

 
Different logging levels
In SSIS there are 4 different logging levels. So what are the options, and what’s the difference between them? You can read all about it on MSDN, but the short version is:

None: Logging is turned off
Basic (default value): All evens are logged, except custom- and diagnostic events
Performance: Only performance statistics, OnError and OnWarning events are logged
Verbose: All events are logged

 
The next questions is: where to change these options. For example, how do I change my logging level to performance?

 
SSISDB / Catalog
When you want to change the logging level for all your deployed SSIS packages, you could change the setting on your catalog. You can do that by opening the “Integration Services Catalogs” on your instance, right-click on your catalog, and choose an option for “Server-wide Default Logging Level”:

 
If you change this, the logging level for all packages in this catalog will change. But maybe you don’t want that, and you only want it for a specific package.

 
SQL Server Agent Job
Another option is do configure this in your SQL Server Agent job:

 
If you use this option, the logging level will change for the package that is executed in the job step. So no permanent change, but only for the duration of the job.

 
At package execution
Another option is to do it at package runtime:

 
Personally I don’t use that option of executing packages, but you might. The same applies as the change on the SQL Server Agent job, this will only change the option for the duration of the execution, so no permanent change.

 
How to determine your best choice
Making a good decision is difficult in this case. When something goes wrong, you want to log everything. But when things run smoothly, it’s a waste of time to log every event that passes by in SSIS. And how do you determine the level of logging you need, without knowing what data is in your SSISDB? If you want to analyze that, you could use a query like this:

SELECT
  EventCounts.EventName,
  EventCounts.NumberOfEvents,
  CONVERT(FLOAT,(CONVERT(FLOAT,[NumberOfEvents]) / SUM(NumberOfEvents) OVER ()) * 100) AS 'Percentage'
FROM
  (
    SELECT
      EM.event_name AS 'EventName',
      COUNT(*) AS 'NumberOfEvents'
    FROM SSISDB.catalog.event_messages AS EM
    WHERE EM.event_name IS NOT NULL
    GROUP BY EM.event_name
  ) EventCounts
ORDER BY EventCounts.EventName ASC

 
This shows you the type of events stored in your SSISDB, the amount of events, and a percentage over the whole dataset. This can help you determine the logging level you need in your specific case.

 
But here’s the catch…
Performance logging doesn’t actually make your packages run faster… Koen sent me this blog post from Matt Masson (Blog | @mattmasson). In his blog post, he explains what events are logged at the specific levels. And this is what he says about performance level:

The Performance log level should be used when you are doing benchmarking and performance tuning for your packages. While it actually logs less messages to the [catalog].[operation_messages] view than Basic, it captures a lot more events internally to analyze the performance of the data flow components. As a result, there is a bit more overhead during execution – packages run with Basic will actually run a little faster than Performance (in this case Performance means “give me all of the performance details”, not “run my packages as fast as you can”).

 
Conclusion
Even though SSIS is easy to use, there are some pros and cons. The longer I work with SSIS (and that’s not on a daily basis), the more of these pitfalls I discover. And even though they can be fixed pretty fast, it normally takes you time to figure out the problem, because I’m not that familiar with the inner-workings of SSIS. And even when you think you have found a solution for your problem, SSIS just works slightly different than expected. I’m just glad I have friends like Koen to help me out when needed! Thanks again Koen!

SSISDB: Why bigger isn’t always better

Two weeks ago I encountered some strange issues with the SSISDB. Packages and processes started failing, and all we could find is this error message:

 
The image above is an extract of the standard SSMS “All Executions” report. You can find this by right-clicking your SSISDB, click on Reports, then Standard Reports, and choose the “All Executions” report.

While the packages started failing, we couldn’t find a cause. Deploying new packages wasn’t possible either. There were no events logged in the report, in the Windows event viewer, or in the SQL Server error log. So there was no starting point to further investigate this issue.

 
Finding the cause
Checking the size of the SSISDB showed some possible cause of the issue: it was 72GB big! This seemed odd, because we don’t store a lot of packages in the SSISDB. I knew SSISDB contained some logging information, but I didn’t knew how much. Checking the “Disk usage per table” report (another standard report in SSMS), the problem became a bit more obvious:

 

 
So basically there were over 67 million records in the database! This shouldn’t have happened, because the log retention should be set so 14 days. But we quickly found the problem:

 

 
The retention was set to 365 days. This could only mean the the SSISDB was re-created, and this setting wasn’t changed after the deploy. This resulted in a lot of extra logging data in the database. Because of the amount of record, the standard “SSISDB Maintenance job” ran for 40+ minutes, instead of a maximum of 8 minutes (which it normally needs to clean up logging). But setting the retention to 14 days and running the maintenance job would just result in an endless running job. So how could you prevent that?

 
Grabbing the bigger hammer
In most cases, grabbing a bigger hammer to solve the problem isn’t your best option. In this case, I didn’t see another way. After some searching, I found this blog post about truncating the log tables in SSISDB. Below you find the copy of the script. The credits for this script go to Ibrahim Naji (Blog | @thinknook).


CREATE PROCEDURE [internal].[cleanup_server_retention_window_truncateall]
AS  
     
SET NOCOUNT ON
     
DECLARE @enable_clean_operation bit
DECLARE @retention_window_length INT
     
DECLARE @caller_name nvarchar(256)
DECLARE @caller_sid  varbinary(85)
DECLARE @operation_id BIGINT
     
EXECUTE AS CALLER
    SET @caller_name =  SUSER_NAME()
    SET @caller_sid =   SUSER_SID()
REVERT
          
     
BEGIN TRY
    SELECT @enable_clean_operation = CONVERT(bit, property_value) 
        FROM [catalog].[catalog_properties]
        WHERE property_name = 'OPERATION_CLEANUP_ENABLED'
         
    IF @enable_clean_operation = 1
    BEGIN
        SELECT @retention_window_length = CONVERT(INT,property_value)  
            FROM [catalog].[catalog_properties]
            WHERE property_name = 'RETENTION_WINDOW'
                 
        IF @retention_window_length <= 0 
        BEGIN
            RAISERROR(27163    ,16,1,'RETENTION_WINDOW')
        END
             
        INSERT INTO [internal].[operations] (
            [operation_type],  
            [created_time], 
            [object_type],
            [object_id],
            [object_name],
            [STATUS], 
            [start_time],
            [caller_sid], 
            [caller_name]
            )
        VALUES (
            2,
            SYSDATETIMEOFFSET(),
            NULL,                     
            NULL,                     
            NULL,                     
            1,      
            SYSDATETIMEOFFSET(),
            @caller_sid,            
            @caller_name            
            ) 
        SET @operation_id = SCOPE_IDENTITY() 


        -- Remove all [internal].[executions] dependancies
        TRUNCATE TABLE [internal].[executable_statistics]
        TRUNCATE TABLE [internal].[execution_component_phases]
        TRUNCATE TABLE [internal].[execution_data_statistics]
        TRUNCATE TABLE [internal].[execution_data_taps]
        TRUNCATE TABLE [internal].[execution_parameter_values]
        TRUNCATE TABLE [internal].[execution_property_override_values]


        -- Remove all [internal].[event_message_context] dependancies
        TRUNCATE TABLE [internal].[event_message_context]

        -- Remove all non-dependant tables
        TRUNCATE TABLE [internal].[operation_os_sys_info]
        TRUNCATE TABLE [internal].[operation_permissions]
        TRUNCATE TABLE [internal].[validations]
        TRUNCATE TABLE [internal].[extended_operation_info]

        -- Deal with [internal].[event_messages] and [internal].[operation_messages]
        ALTER TABLE [internal].[event_message_context] DROP CONSTRAINT [FK_EventMessageContext_EventMessageId_EventMessages]
         
        TRUNCATE TABLE internal.event_messages
         
        ALTER TABLE [internal].[event_message_context]  WITH CHECK ADD  CONSTRAINT [FK_EventMessageContext_EventMessageId_EventMessages] FOREIGN KEY([event_message_id])
        REFERENCES [internal].[event_messages] ([event_message_id])
        ON DELETE CASCADE

        ALTER TABLE [internal].[event_messages] DROP CONSTRAINT [FK_EventMessages_OperationMessageId_OperationMessage]
         
        TRUNCATE TABLE [internal].[operation_messages]

        ALTER TABLE [internal].[event_messages]  WITH CHECK ADD  CONSTRAINT [FK_EventMessages_OperationMessageId_OperationMessage] FOREIGN KEY([event_message_id])
        REFERENCES [internal].[operation_messages] ([operation_message_id])
        ON DELETE CASCADE

        -- Deal with [internal].[executions]

        ALTER TABLE [internal].[executable_statistics] DROP CONSTRAINT [FK_ExecutableStatistics_ExecutionId_Executions]
        ALTER TABLE [internal].[execution_component_phases] DROP CONSTRAINT [FK_ExecCompPhases_ExecutionId_Executions]
        ALTER TABLE [internal].[execution_data_statistics] DROP CONSTRAINT [FK_ExecDataStat_ExecutionId_Executions]
        ALTER TABLE [internal].[execution_data_taps] DROP CONSTRAINT [FK_ExecDataTaps_ExecutionId_Executions]
        ALTER TABLE [internal].[execution_parameter_values] DROP CONSTRAINT [FK_ExecutionParameterValue_ExecutionId_Executions]
        ALTER TABLE [internal].[execution_property_override_values] DROP CONSTRAINT [FK_ExecutionPropertyOverrideValue_ExecutionId_Executions]

        TRUNCATE TABLE [internal].[executions]

        ALTER TABLE [internal].[execution_property_override_values]  WITH CHECK ADD  CONSTRAINT [FK_ExecutionPropertyOverrideValue_ExecutionId_Executions] FOREIGN KEY([execution_id])
        REFERENCES [internal].[executions] ([execution_id])
        ON DELETE CASCADE

        ALTER TABLE [internal].[execution_parameter_values]  WITH CHECK ADD  CONSTRAINT [FK_ExecutionParameterValue_ExecutionId_Executions] FOREIGN KEY([execution_id])
        REFERENCES [internal].[executions] ([execution_id])
        ON DELETE CASCADE

        ALTER TABLE [internal].[execution_data_taps]  WITH CHECK ADD  CONSTRAINT [FK_ExecDataTaps_ExecutionId_Executions] FOREIGN KEY([execution_id])
        REFERENCES [internal].[executions] ([execution_id])
        ON DELETE CASCADE

        ALTER TABLE [internal].[execution_data_statistics]  WITH CHECK ADD  CONSTRAINT [FK_ExecDataStat_ExecutionId_Executions] FOREIGN KEY([execution_id])
        REFERENCES [internal].[executions] ([execution_id])
        ON DELETE CASCADE
         
        ALTER TABLE [internal].[execution_component_phases]  WITH CHECK ADD  CONSTRAINT [FK_ExecCompPhases_ExecutionId_Executions] FOREIGN KEY([execution_id])
        REFERENCES [internal].[executions] ([execution_id])
        ON DELETE CASCADE
         
        ALTER TABLE [internal].[executable_statistics]  WITH CHECK ADD  CONSTRAINT [FK_ExecutableStatistics_ExecutionId_Executions] FOREIGN KEY([execution_id])
        REFERENCES [internal].[executions] ([execution_id])
        ON DELETE CASCADE
         

        -- Deal with [internal].[operations]
        DECLARE @deleted_ops TABLE(operation_id BIGINT, operation_type SMALLINT)

        DELETE --TOP (@delete_batch_size)
        FROM [internal].[operations] 
        OUTPUT DELETED.operation_id, DELETED.operation_type INTO @deleted_ops
        WHERE operation_id != @operation_id

             
             
        DECLARE @execution_id BIGINT
        DECLARE @sqlString              nvarchar(1024)
        DECLARE @key_name               [internal].[adt_name]
        DECLARE @certificate_name       [internal].[adt_name]
             
             
        DECLARE execution_cursor CURSOR LOCAL FOR 
            SELECT operation_id FROM @deleted_ops 
            WHERE operation_type = 200
             
        OPEN execution_cursor
        FETCH NEXT FROM execution_cursor INTO @execution_id
             
        WHILE @@FETCH_STATUS = 0
        BEGIN
            SET @key_name = 'MS_Enckey_Exec_'+CONVERT(VARCHAR,@execution_id)
            SET @certificate_name = 'MS_Cert_Exec_'+CONVERT(VARCHAR,@execution_id)
            SET @sqlString = 'IF EXISTS (SELECT name FROM sys.symmetric_keys WHERE name = ''' + @key_name +''') '
                +'DROP SYMMETRIC KEY '+ @key_name
                EXECUTE sp_executesql @sqlString
            SET @sqlString = 'IF EXISTS (select name from sys.certificates WHERE name = ''' + @certificate_name +''') '
                +'DROP CERTIFICATE '+ @certificate_name
                EXECUTE sp_executesql @sqlString
            FETCH NEXT FROM execution_cursor INTO @execution_id
        END
        CLOSE execution_cursor
        DEALLOCATE execution_cursor

        END
END TRY
BEGIN CATCH
         
         
    IF (CURSOR_STATUS('local', 'execution_cursor') = 1 
        OR CURSOR_STATUS('local', 'execution_cursor') = 0)
    BEGIN
        CLOSE execution_cursor
        DEALLOCATE execution_cursor            
    END
         
    UPDATE [internal].[operations]
        SET [STATUS] = 4,
        [end_time] = SYSDATETIMEOFFSET()
        WHERE [operation_id] = @operation_id;       
    THROW
END CATCH
     
RETURN 0

 
After running this on the test environment, I found out it worked as I expected. I started the script on production (where the problem was found in the first place), and when it finished I changed the retention from 365 to 14 days:

 

 
After doing that, I ran the “SSISDB Maintenance job” (which ran fine), and an SSIS package as a test (worked fine as well). And looking at the record counts, it worked fine:

 

 
I’m not saying you should run this on your system, but I’m just saying this works fine on my system, and in my specific case. If you find any code on the internet, test it on a non-critical system first, before you run it in production!!!

 
Conclusion
Having error logging is a good thing. It can help you determine problems, and you can see if and how process grow over time. But too much logging for SSIS causes performance issues, prevents you from deploying new packages, and can even cause unexpected termination of SSIS packages. One way to prevent that is by changing the history retention of your SSISDB.

T-SQL Tuesday #50 – Automation: yea or nay

T-SQL Tuesday is a recurring blog party, that is started by Adam Machanic (Blog | @AdamMachanic). Each month a blog will host the party, and everyone that want’s to can write a blog about a specific subject.

This month the subject is “Automation”. If you want to read the opening post, please click the image below to go to the party-starter: Hemanth D. (Blog | @SqlChow).



 
Being a DBA, you want to automate as many processes as you can, in order to save time that you can spend on more important things. But have you ever considered over-automating your processes?

 
We’re safe! What can go wrong here?
At one of the companies I worked for, they thought they had everything sorted out. Indexes were rebuild every day or every week (depended on the database), databases and logfiles were shrinked, databases were checked for corruption, backups were running, etc. They felt safe, knowing that if something happened they could anticipate on any situation SQL Server would throw at them. It would blow up in their faces eventually…

One of the first things I checked were the backups. The backup job was running, but only a few databases were actually selected for backup. And the biggest database (500+ GB), which was pretty important, was skipped because it took too long to backup. And guess what, they didn’t EVER test recovering from a backup, because of a lack of disk space and time. And there you have it: a false sense of safety!

I don’t have to tell you not to shrink your database and logfiles. Everybody knows that every time you shrink your database, a kitten dies… Or an index dies… Or the soul of your database… I’m not sure which one, but take your pick. It causes (and I quote Paul Randal (Blog | @PaulRandal) on this!): “*massive* index fragmentation”. Read more about that over at Paul’s blog. Besides that, if your next query needs more space in a data- or logfile you'll see more wait time because of file growth.

The indexes were rebuild every night on the important databases, and every weekend on less used databases. But they never checked if the problem they had before was fixed when switching to this solution.

Also the corruption check was run only on user databases. They never heard of running a corruption check on system databases. The system database were in the backup process, but they never took the time checked if they could restore them or were running a backup of a corrupted database.

 
Focus on the important stuff
So instead of automating all your processes, maybe you should focus on what’s really important. You just automated your backup process. But does it run every time? Are the backups actually written to disk? Can you restore one of the backups you created?

What I’m trying to say is, you can automate tasks whenever and wherever you like, but don’t forget to test them. Once you’ve automated something, plan regular tests to verify if the automated process runs the way you expect it to. And is the end result really the result you want and expect?

 
Don’t reinvent the wheel
Another tip is: don’t reinvent the wheel. There are more people that encountered the same issue, and wrote about it or logged about a solution. So before you build your own maintenance solution, or automate health status reports, check with your community members. There’s can be found help for every problem, but the checkup on that solution is all on you.

SQL Server Agent Jobs Schedules – A hidden evil…

One of the many beautiful features of SQL Server, is the possibility to schedule a query or job within a few minutes by using SQL Server Agent Jobs. Especially for people who want to run a query at a quiet moment of the day, but don’t want to stay up, or wake up at 4:00 AM in the morning.

 
Types of schedules
If you dig into the SQL Server Agent Jobs, you can create a job-specific schedule, or create a shared schedule. Both are essentially the same type of schedule, except the fact that the shared schedule can be a part of a lot of jobs.

 
What type do you need?
If you create a new SQL Server Agent Job, you need to figure out which type of schedule you want to use. If you want to start a number of jobs all at the same time, you want to go for a shared schedule. But if you easily want to move jobs in time (change the start date or start time), then a single (job specific) schedule is what you want.

 
How it works
Once you’ve created a job, you can click on “Schedules”, and then either choose “New” (single schedule), or “Pick” (shared schedule):

 
New schedule
If you decide to create a new schedule, you’ll get a window like this:

In this window you can set the runtime of the schedule, start- and end date, frequency, etc.

 
Pick a schedule
If you go for the “Pick” option, you’ll see a window like this:

In this window, you can select a previously made schedule, so you could run it in specifically created time windows for example.

 
Edit schedule
But what’s the catch? If you change a schedule, this might effect other jobs in the same schedule! This is where my issue comes in. A few weeks back a schedule was changed (we wanted to run the job 2 hours earlier every day), and suddenly several other jobs changed schedule. Without checking, the schedule was changed, and that resulted in a really strange situation with parallel running jobs that almost locked out all other processes.

So if you want to change the schedule of a job, click the “Jobs in schedule” button. Double check if your schedule isn’t turned into a shared schedule by you, or one of your colleagues or customers:

 
Manage schedules on server
But that’s not all. Last week I’ve read an article about shared job schedules, that was tweeted by Pieter Vanhove (Blog | @Pieter_Vanhove). In that article I discovered that you can also right-click on “Jobs”, and click on “Manage Schedules”:

This results in this window, where you can see the shared schedule overview, the amount of jobs per schedule, etc:

 
Conclusion
As I’ve said in previous posts: Always check, double check, and triple check before you make any changes before you run into more issues than you solve with your change!