Tuesday, September 7, 2010

Monitoring Replication Latency With SQL Sentry Event Manager

To start, let me state now that this post applies to a plain vanilla transactional publication on SQL Server 2005 and higher that does not allow anonymous subscribers. That’s not to say it won’t make sense for other topologies, but I have not explored others, and some of the concepts involved don’t make sense or exist for them anyway.

I recently had a really good request come along asking if SQL Sentry can send notification when replication latency breaks a specified threshold.

I like this type of request, because it’s something that isn’t addressed “out of the box”, but I knew that I could put something together with SQL Server functionality and some tools provided by SQL Sentry. I also admittedly have a little bit of a soft spot for replication going back to early 2002.

There are actually two solutions I’ve came up with for this. They both have pros and cons, so there is really nothing wrong with implementing both.

The first, and probably simplest to setup is to use the performance counter collection facility built into SQL Sentry Event Manager (details). Microsoft has seen fit to provide us with some performance counters specifically for tracking replication latency, both for distribution and for the log reader.

The setup procedure is outlined here. The only difference being that you will use the performance counters below, and not the counters discussed in the KB article.

One of the “not so great” aspects of using this method is that some of these counters are specific to a publication/subscription pair, and if your subscribers change, you will have to frequently add and remove counters from this job in the SQL Sentry console application. If your subscriptions are fairly static, then you have no worries here.

The counters you want to add are as follows:

  • SQLServer:Replication Dist.
    • Dist:Delivery Latency
  • SQLServer:Replication Logreader
    • Logreader:Delivery Latency

For the first, you will have a counter instance per publication/subscription combination. This is the one you’ll need to keep an eye on if your adding and dropping subscriptions regularly.

Summing these two counters gives you the time it takes for a transaction to go from the publisher to the subscriber. Unfortunately, the counter values can’t be summed in SQL Sentry and compared to a threshold, and there is no counter that sums them for you, so you will have to track the thresholds separately using the “Performance Counter: Threshold Max” condition for each one. This too is a minor shortcoming of using this method.

Even with a few hurdles though, setting things up using performance counters offers some advantages. It’s not overly complicated to set up, since you can literally do everything you need to do right from the SQL Sentry console application; there is no need to add a stored procedures to your publication database, and you can add the counters from any server to any watched job. That means that you don’t technically have to use a SQL Sentry license to monitor each server that has a publication to monitor. You would only need a license for the server running the job that has the counters attached.

The other method I worked out is a bit more technical, but it provides a way to alert on the latency for all subscriptions at once for a given publication, and bases the threshold on the total latency instead of the two separate values.

Here we’re going to add a stored procedure to the publication database, then create a SQL Server Agent job to run the procedure on schedule.

I will list the code first, then explain what it is doing:

-- This proc must be created in the publication database
-- use CTRL + SHIFT + M to replace variables
use <DatabaseName, varchar(35), Mydatabase>
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[<StoredProcedureName, varchar(50), MyProc>]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[<StoredProcedureName, varchar(50), MyProc>]
/* Prototypes:
exec <StoredProcedureName, varchar(50), MyProc> 'MyPublication', 8, 1, 'myemail@mycompany.com'
exec <StoredProcedureName, varchar(50), MyProc> @pub = 'MyPublication', @emailResults = 1, @emailRecipients='me@mycompany.com;you@yourcompany.com'
exec <StoredProcedureName, varchar(50), MyProc> 'MyPublication'
NOTE: Email functionality works on SQL Server 2005 instances or greater watched by SQL Sentry.
    create proc <StoredProcedureName, varchar(50), MyProc>
        @pub sysname,
        @maxLatencySeconds int = 5,
        @emailResults bit = 0,
        @emailRecipients nvarchar(255) = '',
        @dropTable bit = 0
set nocount on
    -- table def matches sp_helptracertokenhistory output 
    if object_id('tempdb.dbo.##repl_latency_stats') is NULL
        create table ##repl_latency_stats
            ( distributor_latency int NULL
            , subscriber varchar(1000) NULL
            , subscriber_db varchar(1000) NULL
            , subscriber_latency int NULL
            , overall_latency int NULL );
    --clean out the temp table
    truncate table ##repl_latency_stats
    --send the token
    declare @tokenHandle bigint
    exec sp_postTracerToken @publication = @pub, @tracer_token_id = @tokenHandle output;
    --just sit until the max latency * 2 runs out so the token has time to get there
    declare @maxWait int = @maxLatencySeconds * 2;
    if @maxWait > 30
        set @maxWait = 30;
    declare @waitfor varchar(10) = '00:00:' + CAST((@maxWait) as varchar(2));
    waitfor delay @waitfor;
    --get the results the results
    insert ##repl_latency_stats (distributor_latency, subscriber, subscriber_db, subscriber_latency, overall_latency)            
    exec sp_helpTracerTokenHistory @pub, @tokenHandle;
    --get rid of history
    declare @cutoff datetime = getdate();
    exec sp_deleteTracerTokenHistory @publication = @pub, @cutoff_date = @cutoff;
    --process results    
    if (@emailResults > 0 AND (select COUNT(*) from ##repl_latency_stats where overall_latency > @maxLatencySeconds) > 0)
        declare @subject varchar(100)
        declare @message varchar(100)
        declare @query varchar(500)
            @subject = 'Replication Latency Threshold Exceeded',
            @message = 'Replication latency for the following subscriptions has exceeded ' + CAST(@maxLatencySeconds as nvarchar(2)) + ' seconds:' + char(10) + char(13),
            @query =
                'set nocount on
                    cast(subscriber as varchar(25)) as [Subscription Server],
                    cast(subscriber_db as varchar(40)) as [Subscription Database],
                    cast(distributor_latency as varchar(15)) as [Distributor Latency],
                    cast(subscriber_latency as varchar(15)) as [Subscriber Latency],
                    cast(overall_latency as varchar(15)) as [Latency Overall]
                where overall_latency > ' + CAST(@maxLatencySeconds as nvarchar(2));
        exec msdb.dbo.sp_sentry_dbmail_20
            @recipients = @emailRecipients
        ,    @subject = @subject
        ,    @body = @message
        ,    @query = @query
                subscriber as [Subscription Server],
                subscriber_db as [Subscription Database],
                distributor_latency as [Distributor Latency: Seconds],
                subscriber_latency as [Subscriber Latency: Seconds],
                overall_latency as [Latency Overall: Seconds]
    IF @dropTable = 1
        drop table ##repl_latency_stats;

First off, I wrote the procedure using replacement variables, so you can name it whatever you want easily by using CTRL + SHIFT + M in SSMS. That is important to point out, since it won’t run until that is done.

The procedure takes the name of the publication to monitor, the latency threshold to use, 1 or 0 for whether to send an email alert if the threshold is broken, a list of standard format email addresses (separated by ‘;’), and 1 or 0 for whether or not to drop the temp table, which you generally won’t want to do, since it needs to be there when the email alerts are picked up.

We’re using functionality that was added to SQL server 2005 called a tracer token. Basically we’re doing the same thing a plumber does when he drops a microchip into your plumbing, and tracks it through to find a clog, only we’re just seeing how long it takes to get from the publication to the subscriber.

The procedure creates the token, then waits long enough for it to make it through and drops the results into a global temp table.

If you’ve specified to email alerts about breaking the threshold, those results are added to the SQL Sentry email Queue for this server (details), which is why the server needs to be monitored by SQL Sentry Event Manager. The alert will be delivered by the SQL Sentry server service, including the current contents of the temp table that break the runtime threshold, during the next message monitor poll.

Keep in mind that the threshold in this case is on the total latency, which is in seconds. Also, if the tracer token doesn’t make it to the subscriber within 30 seconds, you won’t get any results. This can be increased easily by changing some values just above the “WAITFOR DELAY” call, but 30 seconds should already be well beyond what is needed to get full results.

As always, anyone interested in setting this up that wants more information can post on forum.sqlsentry.net or email support at sqlsentry.net.


No comments:

Post a Comment