Detecting Storage I/O Problems in SQL Server
So here's a scenario. You have a client that has implemented SAN storage for your SQL Servers (could be virtual or physical). Your manager tells you that ever since the SAN team got involved things are running slower! And He's sure it's a SAN problem.
However, the SAN guys aren't going to do anything unless you provide them with some sort of proof. I'm the type that likes to compound common sense with actual metrics. I'm also the type that likes to work together to resolve the issue for our customers rather than play the blame game. So the goal is get the SAN team on board with the investigation.
So, where do we start?
Perfmon Counters:
I like to keep it simple at first and use Perfmon. Why? Most people who work in a Windows environment will understand these metrics. It's like the universal language that can bridge communication problems.
It's a great way to start the conversation as you move into more...advanced metrics (like using DMOs or DMVs.).
So here are a few counters that I like to use (as always, it's good to keep a baseline of performance, so that you have something to compare it to). Under the PhysicalDisk(Data and Log disk) object:
Avg. Disk Sec/Read & Avg. Disk Sec/Write: In milliseconds, these counters track the average time it takes to read/write to and from disk. Ideally, we want this to be below 10ms (but always have a baseline for more accurate results). If it's taking a lot longer, explain to your SAN team what this metric is and see if they have any hardware configurations that can be adjusted.
implement their plan, and revisit these metrics.
That one is the most important to me, other perfmon counters include:
%Disk Time: A good metric to get the overall load of the disk. It won't help us identify bottlenecks, but it's good to include and compare as you tune things.
Also look into Disk Transfer/Sec and Disk Bytes/Sec.
Now let's move on to SQL Server specific tools:
DMO/DMVs:
sys.dm_io_virtual_file_stats: this probably the most useful to query IO within SQL Server. I'm not going to reinvent the wheel here, so see Paul Randal's break down here where you can view aggregate information.
Now on a server with known I/O issues,it's good to periodically capture the information and store it. The reason being is that this information is a cumulative since the last SQL Server service restart. The longer the up time, the harder it is to see the spikes. Also, DBCC checks are quite intensive and they get involved in the calculations here.
But this can help you determine which data files are using the most I/O. So you have a place to start tuning.
The io_Stall_read_ms and the io_stall_write_ms show's the time that users are waiting on disk I/O. You can then cross reference this information with what's in the sys.dm_os_wait_stats to bring around a full argument with metrics to show that disk IO is causing a bottleneck.