Friday, August 17, 2012

My Perspective: The Top 5 Most Common SQL Server Performance Problems

What Prompted This Post?

As lots of you know already, Kevin Kline (b|t) has joined us here at SQL Sentry. I’ve followed Kevin myself for a good while, so I’m really excited to have him here with us.

One of the first things Kevin asked of me was to list the top 5 technical issues we see while providing customer service for SQL Sentry. For our Client Services team, technical support is far more than just supporting our own software products or managing licenses. Everyone on the team is deeply trained in SQL Server, Windows and Microsoft .NET. In fact, in general, any member trained on our Client Services team should be able to walk into the role of a Windows sysadmin, SQL Server DBA or .NET Application Developer with very little adjustment effort.

When presented with a technical issue or question, we try to see past the surface, and read between the lines. Many technical issues or questions are not rooted in what seems apparent on the surface, and the Client Services team is expected to work to resolve the root problem rather than simply addressing the immediate or obvious symptom.

As a result of this expectation, we do come across a variety of technical issues with both Windows and SQL Server, and I thought that filling Kevin’s request for a "Top 5" would make for an interesting blog post. I’m starting with 5 SQL Server Performance issues, and I’ll have another post later on 5 general technical issues that may or may not have anything to do with SQL Server.

I’ll make this a count down from 5 to 1, with 1 being the most common technical problem we see.

Number 5: Tempdb PAGELATCH Contention

Jonathan Kehayias (b|t) covers this topic very well in a great article on Optimizing tempdb configuration with SQL Server 2012 Extended Events, so I’m not going to attempt to go into it very deeply here, but I will talk about my experience with it.

I’m starting to see this more and more. It is usually with a system making heavy use of tempdb for some type of ETL process. This is especially common if it is an ongoing "real-time" style ETL process. I’m not saying data loads are the cause of this by the way; I’m just relaying my observations.

The symptoms of this can vary, but some things are always the same. High PAGELATCH waits in tempdb and poor performance recorded for processes using tempdb. I’ll typically follow the waits to Top SQL in Performance Advisor, and see lots of queries that use temp tables listed in Top SQL. These queries usually run in milliseconds, and should never be counted among the "Top SQL" for the server. This can have people feeling like these queries are a large part of the problem, but that is not necessarily the case at all. The queries are the victims of the real problem.

Once I’m suspecting this to be the case, I will usually jump to the Disk Activity tab in Performance Advisor to see how tempdb is configured. Most times I actually see the same thing: A busy tempdb with a single data file defined. From here I’ll usually recommend reconfiguring tempdb, and direct them to a resource like Jonathan's article for more information.

Number 4: Expecting Auto Update Statistics to Keep Statistics Updated

The problem here is that the thresholds for triggering auto statistics updates end up being the same in most cases, even for a very large table. Without going into a very deep explanation, the threshold is ~20% of the rows in the table. So on a really big table it takes a lot of data change to trigger an update. Kevin Kline has a nice, easy to follow explanation of this here as well.

The reason this makes the list is that DBAs seem really surprised to find out that the auto update isn’t taking care of things the way the name implies. Then there are also many dbas that believe it should be handled by their maintenance job. Then after looking at the maintenance, they are doing index reorgs most of the time, and that won’t update the statistics either (though a rebuild will). I also want to note here that if you are using the Fragmentation Manager feature in SQL Sentry 7.0 and higher, you can have a running history of when your indexes were reorganized rather than rebuilt. This can help you decide whether the problem you’re seeing could be related to auto update not happening.

The lesson here is really to keep an eye on statistics, and make sure they’re updated regularly, especially on large tables, which are becoming more and more common as time goes on. Another option here can be to use trace flag 2371 to actually change the formula used to trigger the update. The nicest explanation of this option I have found is at this blog post by Juergen Thomas.

Number 3: The CXPACKET Wait Type

I would say that this is hands down the single most common wait type I see on larger SQL Server systems when someone asks me to look into query performance with them.

There is a lot of information out there on how to deal with this, but sadly I still see a lot of people make the initial assumption that the problem should be solved by having either the query or the entire server set MAXDOP to 1. More often than not the problem can be handled by proper indexing or statistics maintenance. It could also be that the plan cached for this query is just not optimal, and you can mark it for recompile using sp_recompile, set recompile at the query level, or just evict the plan using DBCC FREEPROCCACHE with a plan handle. It is best to exhaust these options before deciding to change MAXDOP to 1 because you could be throwing away a lot of processing power without realizing it.

Paul Randal (b|t) has a great survey on his blog here that seems to support what I’m used to seeing as well. In fact, he’s the one who first taught me that MAXDOP 1 is not necessarily the answer to this.

Number 2: Misunderstood “Timeout Expired Prior to the Completion of…”

This one is huge. Outside of some very edge case behavior, there are two basic types of timeouts you *might* deal with for SQL Server. These are connection timeouts and operation (or query) timeouts. In both cases these are values set by the client connecting to the SQL Server. On the server side, there is a remote query timeout setting, but this is the very edge case I mentioned and will leave that for another discussion.

I’m going to focus on operation timeouts, since they are the most common. Operation timeout errors from various software tools might be the most misunderstood situation I come across. The cause of these really boils down to one simple thing though: The client executing the command has set a maximum amount of time that it will wait for the command to complete. If this maximum is reached prior to completion the command is aborted. An error is raised from the client.

Many times the timeout error will induce a panic mode, because the error can look a bit intimidating. The reality is, though, that this is not much different than hitting the stop button in SSMS because the query was taking too long. In fact, it will show up exactly the same in a profiler trace with Error = 2 (Aborted).

So, what does a timeout like this really tell us? It tells us that queries are taking longer than expected. We should go into “performance tuning” mode rather than “something is broken” mode. The error information from the client is really just some good information on where you might start to focus your tuning efforts.

If you receive timeout errors from the SQL Sentry monitoring service, and one of the servers you are monitoring is the source, this is not telling you that SQL Sentry is broken. This is SQL Sentry telling you that this server is experiencing performance issues. Again, it is time for “performance tuning” mode. These errors could be easily consumed internally, and retried later, but this would be doing our customers a huge disservice. We believe that you should know about *any* potential problem on your monitored server, even if it is SQL Sentry encountering the problem.

Incidentally, this is true for SQL Sentry, just as it is for any other system that uses an RDBMS for a repository your SQL Sentry database needs some TLC now and again. Without it you may indeed experience some timeouts from your SQL Sentry client. We spend a lot of time tuning our queries for performance before they ever make it out the door, but proper maintenance will ensure they continue to run as intended.

Number 1: Memory Pressure

This is the big one. As soon as Kevin mentioned wanting this list it’s the first thing that popped into my head. Not only because I see it so often, but also because it is so often mistaken for poor disk performance.

There are lots of caches in SQL Server, but the most well-known is the data cache (aka buffer pool). The easiest way to describe the data cache is that it is the data stored in memory, rather than persisted to disk. Being able to store lots of data in memory long term is desirable because working with data in memory is generally much quicker than having to perform physical IOs.

I could turn this post into a very long discussion on memory pressure in SQL Server at this point, but I promise I will try to avoid that. There is already a ton of information available on this subject, and that is not really the intent of this post. What I will say is that, usually, memory pressure manifests as a few different symptoms. When viewed individually, some of these symptoms can lead you to incorrect, and sometimes costly, conclusions.

The two misleading symptoms are that you may start to see higher than normal latency across the disk subsystem, and you may start to see abnormally high waits related to disk activity. If you look at nothing but these two symptoms, you may come to the conclusion that you need to work on your disk system.

This is why being presented with all relevant metrics on one dashboard is so important. You have to look at the bigger picture, and having the memory-related data available along with the disk activity and waits helps to paint a clearer picture of what is really going on.

Typically what I’ll see (along with the disk waits and disk latency) is a PLE (Page Lifetime Expectancy) that is fairly low for this server. I describe it this way because what is good or bad for this value really “depends”. The larger your buffer cache is, the higher your “critical” threshold will be for PLE. The more data there is to churn in and out of the buffer, the worse off you will be when the “churn” actually happens. Another consideration is NUMA. The way the PLE counter is calculated can cause this value alone to be very misleading when multiple NUMA nodes are involved, as described by Paul Randal in a blog post about Page Life Expectancy isn't what you think... Luckily in SQL Sentry 7.0 and higher, you can actually see where PLE is for the individual NUMA nodes in history mode, which makes this a bit less of a problem.

I’ll usually also see consistently higher lazy writer activity, and SQL Server page faults (SQL Server going to disk). Sometimes I’ll see what I call buffer tearing. It’s basically when the data buffer is up and down frequently creating a jagged (or torn) edge on the history chart in Performance Advisor. Finally, I may also see an abnormally large plan cache reducing available memory for the data cache.

All of these things together spell memory pressure, and there are various ways to deal with them, but the important thing to note is that this is not a disk issue. It’s not saying that your disk system is necessarily wonderful either, but I am saying I wouldn’t call up your SAN guy and order a bunch of new hardware based on this situation. Once you get the memory pressure situation under control, SQL Server will not need to go to disk as much, and the few symptoms related to disk may disappear entirely!

The moral here is really to always consider the full picture of performance, because looking at one thing out of context could severely limit your options for a solution.

Honorable Mention: SQL Server Agent History Retention Settings Unlimited

We see this enough to include it in this list, and I think anyone that uses SQL Server Agent should be aware of it.

In SQL Server Agent Properties, under History, you can adjust retention settings.

image

For some reason, I’ve seen quite a few people set this to unlimited by unchecking both checkboxes. If you do this, and you use Agent jobs frequently, eventually you’re going to run into problems with job history in MSDB, because these tables aren’t really indexed very well. The settings I’m using above are generally fine for most cases, and if you’re using SQL Sentry Event Manager, you’re keeping this information in the SQL Sentry database anyway, so retaining it here is just redundant.

Conclusion

So there are my (current) top 5 most common SQL Server performance issues/topics. For #4 and #5, I actually had to run some numbers to find out what they were, but for the top three, I knew without having to consider it much at all.

Thanks for reading!

4 comments:

  1. The tongue-in-cheek response:

    1) End users writing poor queries
    2) Power users writing poor queries
    3) Junior developers writing poor queries
    4) Senior developers writing poor queries
    5) Software architects writing poor queries.

    ReplyDelete
    Replies
    1. You know, I started with that, but I was having trouble filling a page ;)

      Delete
  2. Another quick note on client timeouts.

    The knee jerk reaction is very frequently "can we increase the timeout". The answer is no. That query is performing poorly. It is a big ugly troll eating up time, and the answer is not to feed the troll more time. You want to take that troll to the gym, and get him running faster by tuning him instead!

    ReplyDelete
  3. You bring up an excellent point about the regular system process histories, like those in the SQL Agent. I've also seen that happen with job history and job step history in MSDB where that database grows to enormous size too.

    Great post, btw!

    -Kevin

    ReplyDelete