Monday, July 22, 2013

My Perspective: The Top 5 Most Common Windows Problems That Might Prevent Monitoring SQL Server

Introduction

It’s been quite some time since I promised that I would write this list, so I’ll get to it without a lot of needless chat.

We all know that it is very important to monitor events and performance on SQL Server, but many times our well meaning paths are blocked by the host operating system itself. This is a list of the most common issues I run into here at SQL Sentry when troubleshooting issues that keep folks from being able to fully monitor their database server.

Note that everything on this list has something to do with a native piece in Windows, or drivers written for Windows. There are many things that can reduce your ability to monitor a server, but this is a Windows subsystem only list. The basic idea is that these are things outside of SQL Server that you may want to look at if you’re having trouble collecting counters or other valuable system information.

Number 5: Cross Domain Security

In a lot of cases, we’re trying to access servers in a different domain than our user account belongs to. I run across this frequently when a DMZ type configuration is involved. Typically SEs will not set up any sort of trust between the corporate domain and the internal domain of a DMZ. I won’t go into design principles of different network topologies or security models here, but suffice to say that it is for a very good reason.

If you can’t access performance counters or WMI because of this, the answer is pretty clear. You need to gain access to the other domain in some way.

The two most common recommendations I usually provide are to use a “pass-through” account, or to set up a leg of your monitoring environment inside the other domain.

By “pass-through” account I am referring to the age old act of having a user account with the same login name and password as yours that lives on the server you’re trying to access. Windows has allowed this for as long as I can remember, but I also wouldn’t trust that it will always be there. Newer concepts like UAC also complicate things when using this method, so I usually recommend moving into the other domain over pass-through.

Number 4: WMI Repository Corruption

In my opinion, the whole WMI stack is a bit of a train wreck for something that is so vital when it comes to system management and monitoring. Something as simple as a provider delivered by a new NIC driver or disk controller can corrupt the repository to the point that it can’t be used. I don’t see this as often since Windows Server 2008, but it does still happen.

It is also difficult to determine that you have corruption because you will generally just get strange errors that don’t mean a lot just by reading them. If you suspect issues with the WMI repository, there are three tools you can use to help ferret it out.

  1. WBEMTest
  2. WMIDiag/winmgmt
  3. WMI Tracing

A lot of times, if the repository is not actually corrupt, it may be just a bad provider that can be fixed by updating to the latest driver for one device or another, but if you have indeed got a corrupt repository you may need to rebuild it. You can find instructions for that here.

Number 3: Strict Firewall Rules

Remote registry is needed to access performance counters on a remote Windows server. Remote registry works over SMB. SMB uses RPC. Therefore, besides having the Remote Registry Service running, you need to have the ports needed for RPC open to be able to read performance counters remotely. This can get “interesting” when firewalls are involved, because RPC uses dynamic ports. No firewall admin is going to let you open the default dynamic port range for MSRPC through the firewall, which is different depending on what version of Windows you’re dealing with, but is quite large regardless.

If you’re interested in this topic, here are a couple of good reads on the subject:

  1. How RPC Works
  2. Setting Windows Dynamic Port Range

So, what needs to happen here is that all three parts (firewall, host and target server) need to be in alignment with what range of dynamic ports can be used. If they are not in alignment you can easily have RPC that just does not work at all, or that works sometimes, and fails when you least expect it.

If you know a firewall is involved, you really need to do some planning and get your SEs involved in your monitoring effort to ensure that everything in that pipeline gets configured correctly for you.

Number 2: Name Resolution Issues

This happens all the time. You can’t connect to WMI or Perfmon, but you know everything is configured correctly, and there are no network issues. You have all the rights you need, but something is still not working.

Open up a command prompt and run a ping on your host name or perform an nslookup. Better yet, run a tracert to ensure that you are not only getting to a server, but to the right server.

Countless times I’ve beat my head on my desk wondering what the heck is wrong, and when I finally tried using only the IP address everything was fine.

Sometimes DNS entries are changed without everyone involved being told. Sometimes the IP address is changed and not everyone is told. Sometimes there are even duplicate entries. Regardless of what caused it, the issue can be resolved by just fixing DNS, using the correct server name, using the IP directly, or sometimes you can just run “ipconfig /flushdns” from a command prompt to refresh your local DNS cache.

Number 1: PerfLib Corruption

Without a doubt, the number one problem I see with monitoring Windows performance counters is when the counter repository is corrupted.

The performance counter system in Windows can be very touchy. Some of the things that I have seen cause corruption in PerfLib counters are Windows Updates, SQL Server Service Packs and/or CUs, Software Installations, Cluster Failovers and Power Outages. There really seems to be not much rhyme or reason to what can bring down the performance counters we rely on so much to be able to tell what resources we’re in short supply of on our servers.

This happens so much in fact, that we have a frequently visited Q&A thread that provides instructions on rebuilding your counter library in whole or in part when the need arises. You can find that here.

Honorable Mention: TCP Offloading

I wanted to include this one, because it has really bitten me a few times. There are some settings that allow certain functions to be offloaded to your NIC to save CPU cycles. There have been more than a few cases where having these enabled has caused some really crazy behavior. For example, in one case the user was getting timeouts from the client, but there were never any queries running long enough on the server to cause the client to timeout. After an embarrassingly long time, I came across this blog post. It wasn’t exactly my problem, but it was sort of similar. After some experimenting we discovered that disabling TCP Large Send Offloading solved the timeout problem.

If you are having dropped connections, strange timeouts, or slow responses that make absolutely no sense, this information might be for you.

Conclusion

So there we have my list of 5, or 6, very common issues related to Windows itself (or drivers written for Windows) that can really ruin your day if you need to monitor SQL Server performance or events from outside of SQL Server. I hope that someone can learn from my experiences and really save some time if they run into any of these.

Until next time,
-JRH

No comments:

Post a Comment