A Long and Winding Road? Doing DMA More Directly Ó 1997 OSR Open Systems Resources, Inc. Those of us who have wrestled our way through the arcana of adapter objects and mapping registers know how much "fun" implementing drivers for busmaster DMA devices can be. NT’s architecture for DMA drivers is consistent, extensible, and designed to enable transparent support for different CPU architectures. Unfortunately, it can also be convoluted, obscure, and result in higher latency and higher CPU utilization than many of us would like. How about some more direct alternatives? But first… Words To The Wise This is the sort of article that will inevitably create headaches for our friends (and, we do consider them our friends, by the way) in Microsoft’s support group. In this article, we describe some advanced alternative techniques for implementing busmaster DMA drivers for NT. These techniques aren’t supported, and should only be used by experienced driver writers. If you don’t already know how to write DMA drivers the supported way on NT, you really won’t sufficiently understand the trade-offs made by the techniques presented in this article. Finally, this article focuses on drivers for "packet based" busmaster DMA devices only. Nothing discussed in this article applies to system (i.e., slave) DMA or to common buffer busmaster DMA designs. The NT Model For Busmaster DMA This section presents a very brief review of how drivers deal with packet based busmaster DMA on NT. Most devices that process one request at a time, directly into or out of a user buffer, implement this type of DMA. Again, this is a review. We assume you already really know how this stuff works. The basic design of drivers for DMA devices on NT centers around the Adapter Object. During driver initialization, the capabilities of the device are described in detail to the HAL as part of the call to HalGetAdapter() , using the DEVICE_CHARACTERISTICS data structure. These characteristics indicate: * If the device is a busmaster; * If the device supports 32 bit addressing; * If the device supports scatter/gather; * The device’s maximum transfer length. HalGetAdapter() returns a pointer to an (opaque) ADPATER_OBJECT, which the typical driver will save for later use. It also returns the maximum number of mapping registers that the driver will be allowed by the HAL to use at any one time. These registers are implemented by the HAL. Their main purpose is to allow devices to reach user buffers located any where in memory, irrespective of the DMA device’s native addressing capabilities. This capability allows a busmaster DMA device on the ISA bus (which only has 24 address lines), to address all of NT’s physical memory (which is presently limited to 32 bits). Mapping registers also play a part in helping devices that do not support scatter/gather. But more about that later. When processing a transfer request, a driver first calls KeFlushIoBuffers() to ensure the described data buffer is ready for DMA. It then calls IoAllocateAdapterChannel(). This function takes as input, a pointer to the Adapter Object returned by HalGetAdapter(), a pointer to the Device Object for the device doing the transfer, the number of mapping registers required for this transfer, and a driver-defined context pointer. IoAllocateAdapterChannel() coordinates access to the mapping registers (which are shared resources), and when a sufficient number are available, calls the driver back at its AdapterControl() function, a pointer to which is also supplied as part of the call. A driver receives the following inputs to its AdapterControl() routine: * A pointer to the Device Object; * A context pointer; * A pointer to the current IRP to process, if the IoAllocateAdapterChannel() was called from the driver’s StartIo() routine; * A MapRegisterBase value, for subsequent use. In its AdapterControl() routine, the driver retrieves the starting virtual address of the user’s buffer by calling MmGetMdlVirtualAddress() . Armed with all this information, the driver can now get the base address and length of the data buffer to be handed off to the device for the DMA operation. To get the base address and length of the data buffer, a driver calls IoMapTransfer() . This function takes the following parameters: * Pointer to the Adapter Object. Busmaster devices always pass a NULL for this parameter; * Pointer to the MDL describing this transfer; * The MapRegiserBase, which was passed into the AdapterControl() routine by the I/O Manager as a parameter (and is never modified by the driver); * The current virtual address of the transfer. This is just the value returned by MmGetMdlVirtualAddress() if this is the first fragment being mapped for this MDL, or else is that value updated by the length mapped so far (see below); * The (remaining) length of the data buffer; * A BOOLEAN indicating the direction of the transfer (TRUE is a write to the device). Each call to IoMapTransfer() returns a (physical) base address and length to be passed to the device for this transfer. Since the length returned may be less than the length of the entire transfer, a driver must keep track of this. Devices which support scatter/gather will call IoMapTransfer() iteratively within the AdapterControl() routine. The resulting base address and length pairs will become the device’s scatter/gather list. Given this information, a driver can initiate the DMA transfer on the device. Busmaster devices complete their AdapterControl() functions by returning the special status value DeallocateObjectKeepRegisters. After this particular DMA operation is complete, the driver calls IoFlushAdapterBuffers(), passing a pointer to the Adapter Object (again, busmasters pass NULL), a pointer to the MDL, the MapRegisterBase, the base virtual address for the transfer, the total transfer length, and a BOOLEAN indicating the direction of the transfer. Sounds like almost the same parameters passed to IoMapTransfer(), doesn’t it? If the entire user buffer has not been transferred, a driver will issue additional calls to IoMapTransfer(), and issue more DMA requests to the device, followed by more calls to IoFlushAdapterBuffers(). When the entire transfer has been completed, the driver calls IoFreeMapRegisters(), passing a pointer to the Adapter Object, the map register base, and the number of map registers that were used. But, I Thought I Was Doing DMA? So, you’ve got a device that needs a driver. It’s a PCI device, so it supports 32-bit addressing. But, it doesn’t support scatter/gather (and there are plenty of devices that don’t). You carefully implement your driver The NT Way. The code path is long and, to put it kindly, complicated. Being the performance freak you are, you carefully profile and benchmark your driver. You find: * There is a widely varying delay between your calls to IoAllocateAdapterChannel() and your Adapter Control routine being called. Sometimes, maybe even most times, there is no delay. Other times, the delay is "significant"; * The CPU utilization when running your driver is much higher than you had hoped. What’s going on? Well first of all, it’s important to understand that the way many of the DMA support functions are implemented is up to the HAL. For virtually every x86 HAL, mapping registers are just an abstraction – there is no specific hardware that implements "mapping registers" in these systems. Rather, the HAL implements these mapping registers by reserving (at system startup time) a group of page-sized buffers in low system memory (below the 16MB mark). When a DMA device that is incapable of addressing all of physical memory (such as an ISA bus device) wants to perform a DMA transfer, the HAL uses one or more if its pre-reserved buffers in low memory as an intermediate buffer for the transfer. As an example, let’s consider a DMA read from such an ISA bus device. In the call to IoAllocateAdapterChannel() , the HAL notices that this device does not support 32 bit addresses, and reserves the requested number of contiguous buffers in low memory (the number of "map registers", i.e. buffers, being an input parameter to IoAllocateAdapterChannel()). Then in response to the driver call to IoMapTransfer(), the HAL returns the base physical address of these buffers in place of the base physical address of the actual data buffer. The data from the device is then DMA’ed into the HAL’s buffers. When the driver calls IoFlushAdapterBuffers() , the HAL merely copies the data from its buffers to the original data buffer. But, you may ask, what does this have to do with our PCI busmaster device? Our PCI device supports 32-bit addresses. So there’s no reason for the HAL to use these intermediate buffer mapping registers, right? Wrong! There’s a catch: Our PCI device does not support scatter/gather. When you called HalGetAdapter(), the HAL noticed this and decided to "help". It did this by performing the same intermediate buffering action for each of your device’s transfers that it did for our example ISA bus device. Thus, when you called IoMapTransfer(), it allocated a set of contiguous buffers, and returned you the base physical address and length of those buffers as the base address for the DMA input buffer. As a result, the device DMAed the data to this intermediate buffer. The data is later recopied to the real destination data buffer. The HAL did this in its quest to agglomerate fragments of the data buffer, and thereby reduce the number of iterations required by your driver to transfer the complete buffer. This accounts for the higher than expected CPU utilization. After the data has been DMA’ed it’s being copied out of the HAL’s buffers and into the real data buffer. You therefore get all the complexity of supporting a DMA device, with all the CPU usage of a device that does programmed I/O! This also accounts for the varied latency between your calls to IoAllocateAdapterChannel(), and the I/O Manager’s callback to your Adapter Control function. The HAL’s mapping registers and low memory buffers are scarce resources. Sometimes, they might be in use by another device when you call IoAllocateAdapterChannel() . This causes your request to be queued until the requested number of map registers/buffers are available. Hence, you wait. Just Lie So, if you don’t like the way the HAL is helping you out, what can you do about it? The answer is simple: You can lie! If you’re writing a driver for a 32-bit busmastering DMA device, and you want to avoid the HAL’s help described above, all you need to do is tell the HAL that you support scatter/gather. It doesn’t matter whether you really support scatter/gather or not, of course. This way, the HAL won’t get it into its head that it should help you, and the physical buffer base addresses returned by IoMapTransfer() will be the ones corresponding to the actual data buffer. But, just as your mother taught you, lying has its consequences. Not every HAL will implement mapping registers for non-scatter/gather devices the way I’ve described. Thus, you sacrifice some degree of compatibility and portability. There’s another issue, however. Will it really take less CPU time for your driver to perform multiple iterations to DMA the data buffer? If the data buffer has a couple of large fragments, then the shortcut is probably worthwhile. If the data buffer has lots of little page-size fragments, it might take more CPU time to setup and complete the multiple DMA operations than it would to just copy the data buffer and issue one DMA (as the HAL does). So, is this shortcut worth it? Those of you who’ve attended our famous Windows NT Kernel Mode Device Driver seminar already know the answer to this question: "It depends." This particular lie seems pretty safe. So, if you’re interested in reducing your driver’s CPU utilization as much as possible, and the data buffer chunks are large, then gain in CPU time from using this shortcut is probably worth the "cost" in potential (rare) loss of portability. Getting Less Abstract How about drivers for 32-bit busmastering devices that do support scatter/gather? Are there any shortcuts they can take? Well, obviously NT’s model for a driver doing DMA is heavily focused around coordinating access to shared resources using the Adapter Object. But, devices that support scatter/gather really don’t have any resources that they need to share since they don’t use mapping registers. So, do we need to go through the complexity of calling IoAllocateAdapterChannel() , and then having the I/O Manager (immediately) call us back at our Adapter Control routine? Well, we build the scatter/gather list for our device by calling IoMapTransfer(), which is called within our Adapter Control routine. The reason it’s called there is that one of the parameters on the call to IoMapTransfer() is the MapRegisterBase, which is passed to us as input to our Adapter Control routine. Perhaps we can build the scatter/gather list without calling IoMapTransfer() ? Well, this is certainly possible. We could build the scatter/gather list directly by decoding the MDL, since the current MDL format is fairly clearly described in ntddk.h. Is this a good idea? No. It is in fact a particularly hideous idea. The MDL is one of the fully opaque data structures used by the I/O Manager. There is absolutely no reason why the format of the MDL couldn’t change from release to release of NT. So, while we’re in favor of taking shortcuts, this particular shortcut doesn’t seem to us to be a reasonable one to take. If we need to use IoMapTransfer() to build the scatter/gather list, is there a way to call it without having an Adapter Control routine? The answer here is "sure"! The secret is in knowing that drivers that will not be actually using mapping registers for their transfers, seem to always be passed a MapRegisterBase value of zero. Thus, we can simply call IoMapTransfer(), without having an AdapterControl() function, as long as we supply zero as the value for MapRegisterBase. At least to me, this optimization seems a bit more reasonable than decoding the MDL. The abstraction of allocating an Adapter Object, and the code it involves, is all avoided. And, we still use the I/O Manager supplied function to decode the MDL format. While reasonably safe, this optimization is also not without cost. What are the potential pitfalls? For one thing, not every HAL might use a value of zero for MapRegisterBase to mean "no mapping registers used". So, there is a potential compatibility issue. In addition, by using this shortcut you are by-passing almost all of NT’s carefully crafted DMA model. This in itself could cause some compatibility problem in the future. For example, it might cause you problems when 64-bit memory support is added to Windows NT. What happens when the system on which your driver is running has more memory than can be reached by your 32-bit busmastering device? If you use the standard NT DMA model, it’s possible that the HAL will provide intermediate buffering for your transfers by using its mapping registers in the same way it does for 24-bit devices today. If you bypass the IoAllocateAdapterChannel() step, you will not get this help, and your driver will not work. So, once again, whether this optimization is worth the cost canonly be answered given knowledge of your project. It’s All Tradeoffs As you have seen, there are a number of shortcuts you can take in NT’s DMA model to optimize your driver’s performance. But, as we mentioned at the outset, you’re on your own with these methods. Use them with care, when the cost is justified by the gain, and you’ll be on your way to more direct DMA. Return to Previously Published Articles Return to OSR's Home Page [Image]consulting& developing/ kits / seminars / NT insider/ resources / client area / about OSR