Performance Instrumenting Device Drivers
Copyright© 1998 Mark Russinovich
Last Updated February 24, 1998
Introduction Many device drivers work with high-performance devices such as disk drives, data acquisition cards, and network adapters. Its often the case that developers are charged with the task of making their driver squeeze as much performance from a device as possible, within the constraints of being a well-behaved kernel-mode citizen. In competitive product areas a positive difference of a few percentage points of benchmark performance can make for boosted sales through good reviews. Unfortunately, the only support for accurate performance measurements is KeQueryPerformanceCounter, which on many HALs offers less than stellar resolution. For example, on uniprocessor x86's with the standard NT HAL the resolution of the KeQueryPerformanceCounter source is 1 microsecond. This resolution is unacceptably low for most device driver functions since they would be lost in the noise.

To address the issue of obtaining ultra-accurate device driver profiles, I've developed a set of macros and support routines that rely on the Pentium and higher processor's built-in cycle counter. By definition the cycle counter runs at the rate of the CPU system clock, and is therefore extremely fine grained (e.g. on a 200MHz machine the resolution is about 5 nanoseconds). Defining a new counter and instrumenting a driver, while a manual process, is quite easy, so a typical driver can be instrumented within minutes. When the performance summary function is invoked, you'll not only see how many times a counter entry/exit pair was tripped, but how many total cycles, total microseconds, and average cycles and microseconds were spent in each entry/exit pair you've defined.

The rest of this page will describe how to instrument your own drivers, and I include source code to an instrumented version of NTRegmon as an example you can refer to and play with.
Defining and Placing a Counter The first step in using the performance macros is to determine which functions or code segments you want to instrument. Next you must define a unique macro identifier to represent the measured code. To insure that these macro names have little chance of colliding with existing macro definitions, I've defined the standard of using "PERF_" appended with the name of an instrumented function, or an artificial name describing a code segment. For example, if I was instrumenting a StartIo routine I would define PERF_STARTIO as the name of its performance counter.

Next, you place the calls to start and stop the counter you are defining at the appropriate places in your code. Place the macro PERF_ENTRY(COUNTERNAME) at the function or code segment's entry point, and PERF_EXIT(COUNTERNAME) at all exit points if the code will be running at DISPATCH_LEVEL or lower. If the function is an ISR, or is called from an ISR, use PERF_ISR_ENTRY() and PERF_ISR_EXIT() instead. Note that a PERF_ENTRY must be placed at the start of a C-logical block if you are using C, since the macro declares temporary variables.

As a simple example, if you were instrumenting your StartIo routine you would place the counters like this:

  VOID  Startio(IN PDEVICE_OBJECT Device, IN PIRP Irp)
  {
      Local variables;

      PERF_ENTRY( PERF_STARTIO );

      .... // Driver Entry code

      PERF_EXIT( PERF_STARTIO );
  }

The next step is to add the macro definition to the PERF.H include file that I've provided. Simply add it to the list of your defined counters, giving each one a unique ID number. Then add a textual description for the counter to the CounterNames array in PERF.C. The description's index in the array must correspond to the ID that you assigned the counter.

Finally, add a call to InitPerfCounters() in your DriverEntry, and calls to DumpPerfCounters() whereever you want to see the state of the counters. A call to DumpPerfCounters would typically be in a DeviceIoControl where a user-mode program is triggering a dump of performance information, in an unload routine, or in response to IRP_MJ_SHUTDOWN. That's all there is to it.

Add PERF.C (a file I supply) to your driver's SOURCES file, and include PERF.H in any modules that make use of counters. Note that by default the function DumpPerfCounters uses a form of DbgPrint that will only print output when building a Checked version of your driver. When you want the counters removed for a release of the driver simply edit the define "#define PERFORMANCE 1" to "#define PERFORMANCE 0" in PERF.H and rebuild.

Interpreting the Counters What you'll see when DumpPerfCounters will look something like the following, which is taken from the instrumented version of NTRegmon that can be downloaded below:
==============================================================================

 Function                            Count          Length             Average

==============================================================================

 MatchWithPattern                     9789           49500              5 (uS)

                                                   9868614           1008 (CY)

------------------------------------------------------------------------------

 UpdateStore                          3263          555448            170 (uS)

                                                 110737077          33937 (CY)

------------------------------------------------------------------------------

...

-----------------------------------------------------------------------------

 RegmonDeviceControl                    59           15645            265 (uS)

                                                   3119075          52865 (CY)

------------------------------------------------------------------------------



Cycles Per Second   : 199365268

Cycles in PERF_XXXX : 208



NT Performance counter frequency: 199300000
Each counter section includes the descriptive name supplied in CounterNames, the number of times the entry counter of that name was executed, and the total time spent in the entry/exit pair in both microseconds (uS) and cycles (CY). In addition you'll see the average time spent in each execution of the entry exit pair (simply the total time divided by the count).

At the bottom of the output you'll see the number of cycles per second InitPerfCounters measured for the system. If this does not match your processor's clock rate than something is amiss. During the calibration phase, InitPerfCounters also determines the approximate overhead of an individual entry/exit pair; the output statistics are adjusted to remove this calculated overhead, which is shown on the "Cycles in PERF_XXX" line of the output. Finally, just for kicks the resolution of the NT performance counter (KeQueryPerformanceCounter) is shown.
The Sample I've instrumented NTRegmon as a demonstration of the counters and their use. Defining a total of 15 counters and placing them in the code took about 5 minutes. When you exit the NTRegmon GUI (run it from \gui\release) the driver will unload and dump the counter statistics via DbgPrint. Use DebugMon, Windbg, or SoftICE to view the results.

Download Instrumented NTRegmon (188KB)

Download Perf.c and Perf.h alone (2KB)