Insights Diagnosing Disk Performance Issues

Diagnosing Disk Performance Issues

Disk performance issues can be hard to track down but can also cause a wide variety of issues.  The disk performance counter available in Windows are numerous, and being able to select the right counters for a given situation is a great troubleshooting skill.  Here, we’ll review two basic scenarios – measuring overall disk performance and determining if the disks are a bottleneck.

Measuring Disk Performance

When it comes to disk performance, there are two important considerations: IOPS and byte throughput.  IOPS is the raw number of disk operations that are performed per second.  Byte throughput is the effective bandwidth the disk is achieving, usually expressed in MB/s.  These numbers are closely related – a disk with more IOPS can provide better throughput.

These can be measured in perfmon with the following counters:

  • Disk Transfers/sec
    • Total number of IOPS.  This should be about equal to Disk Reads/sec + Disk Writes/sec
  • Disk Reads/sec
    • Disk read operations per second (IOPS which are read operations)
  • Disk Writes/sec
    • Disk write operations per second (IOPS which are write operations)
  • Disk Bytes/sec
    • Total disk throughput per second.  This should be about equal to Disk Read Bytes/sec + Disk Write Bytes/sec
  • Disk Read Bytes/sec
    • Disk read throughput per second
  • Disk Write Bytes/sec
    • Disk write throughput per second

These performance counters are available in both the LogicalDisk and PhysicalDisk categories.  In a standard setup, with a 1:1 disk-partition mapping, these would provide the same results.  However, if you have a more advanced setup with storage pools, spanned disks, or multiple partitions on a single disk, you would need to choose the correct category for the part of the stack you are measuring.

disk-perf-1.png

Here are the results on a test VM.  In this test, diskspd was used to simulate an average mixed read/write workload.  The results show the following:

  • 3,610 IOPS
    • 2,872 read IOPS
    • 737 write IOPS
  • 17.1 MB/s total throughput
    • 11.2 MB/s read throughput
    • 5.9 MB/s write throughput

In this case, we’re seeing a decent number of IOPS with fairly low throughput.  The expected results vary greatly depending on the underlying storage and the type of workload that is running.  In any case, you can use these counters to get an idea of how a disk is performing during real world usage.

Disk Bottlenecks

Determining if storage is a performance bottleneck relies on a different set of counters than the above.  Instead of looking at IOPS and throughput, latency and queue lengths needs to be checked.  Latency is the amount of time it takes to get a piece of requested data back from the disk and is measured in milliseconds (ms).  Queue length refers to the number of outstanding IO requests that are in the queue to be sent to the disk.  This is measured as an absolute number of requests.

The specific perfmon counters are:

  • Avg. Disk sec/Transfer
    • The average number of seconds it takes to get a response from the disk.  This is the total latency.
  • Avg. Disk sec/Read
    • The average number of seconds it takes to get a response from the disk for read operations.  This is read latency.
  • Avg. Disk sec/Write
    • The average number of seconds it takes to get a response from the disk for read operations.  This is write latency.
  • Current Disk Queue Length
    • The current number of IO requests in the queue waiting to be sent to the storage system.
  • Avg. Disk Read Queue Length
    • The average number of read IO requests in the queue waiting to be sent to the storage system.  The average is taken over the perfmon sample interval (default of 1 second)
  • Avg. Disk Write Queue Length
    • The average number of read IO requests in the queue waiting to be sent to the storage system.  The average is taken over the perfmon sample interval (default of 1 second)
disk-perf-2.png

Here are the results on a test VM.  In this test, diskspd was used to simulate an IO-intensive read/write workload.  Here is what the test shows:

  • Total disk latency: 42 ms (0.042 seconds is equal to 42 milliseconds)
    • Read latency: 5 ms
    • Write latency: 80 ms
  • Total disk queue: 48
    • Read queue: 2.7
    • Write queue: 45

These results show that the disk is clearly a bottleneck and underperforming for the workload.  Both the write latency and write queue are very high.  If this were a real environment, we would be digging deeper into the storage to see where the issue is.  It could be that there’s a problem on the storage side (like a bad drive or a misconfiguration), or that the storage is simply too slow to handle the workload.

Generally speaking, the performance tests can be interpreted with the following:

  • Disk latency should be below 15 ms.  Disk latency above 25 ms can cause noticeable performance issues.  Latency above 50 ms is indicative of extremely underperforming storage.
  • Disk queues should be no greater twice than the number of physical disks serving the drive.  For example, if the underlying storage is a 6 disk RAID 5 array, the total disk queue should be 12 or less.  For storage that isn’t mapped directly to an array (such as in a private cloud or in Azure), queues should be below 10 or so.  Queue length isn’t directly indicative of performance issues but can help lead to that conclusion.

These are general rules and may not apply in every scenario.  However, if you see the counters exceeding the thresholds above, it warrants a deeper investigation.

General Troubleshooting Process

If a disk performance issue is suspected to be causing a larger problem, we generally start off by running the second set of counters above.  This will determine if the storage is actually a bottleneck, or if the problem is being caused by something else.  If the counters indicate that the disk is underperforming, we would then run the first set of counters to see how many IOPS and how much throughput we are getting.  From there, we would determine if the storage is under-spec’ed or if there is a problem on the storage side.  In an on-premise environment, that would be done by working with the storage team.  In Azure, we would review the disk configuration to see if we’re getting the advertised performance.