A Long and Winding Road? Doing DMA More Directly

                  Ó 1997 OSR Open Systems Resources, Inc.



Those of us who have wrestled our way through the arcana of adapter objects
and  mapping  registers  know  how  much  "fun"  implementing  drivers  for
busmaster  DMA  devices  can  be.  NT’s  architecture for  DMA  drivers  is
consistent,  extensible, and  designed  to enable  transparent support  for
different  CPU architectures.  Unfortunately,  it can  also be  convoluted,
obscure, and result in  higher latency and higher CPU utilization than many
of  us would  like.  How about  some more  direct alternatives?  But first…

Words To The Wise

This is  the sort of article that will  inevitably create headaches for our
friends (and,  we do consider them our friends,  by the way) in Microsoft’s
support  group. In  this  article, we  describe some  advanced  alternative
techniques for implementing busmaster  DMA drivers for NT. These techniques
aren’t supported, and should only be used by experienced driver writers. If
you don’t  already know how to  write DMA drivers the  supported way on NT,
you  really  won’t  sufficiently  understand  the trade-offs  made  by  the
techniques presented in this article.

Finally, this  article focuses on drivers  for "packet based" busmaster DMA
devices  only. Nothing discussed  in this article applies  to system (i.e.,
slave) DMA or to common buffer busmaster DMA designs.

The NT Model For Busmaster DMA

This section  presents a very brief review of  how drivers deal with packet
based busmaster DMA on NT. Most devices that process one request at a time,
directly into  or out of a user buffer, implement  this type of DMA. Again,
this is a  review. We assume you already really  know how this stuff works.

The  basic design  of  drivers for  DMA devices  on  NT centers  around the
Adapter  Object.  During driver  initialization,  the  capabilities of  the
device  are  described  in  detail to  the  HAL  as part  of  the  call to
HalGetAdapter() , using  the  DEVICE_CHARACTERISTICS data  structure. These
characteristics indicate:

   * If the device is a busmaster;

   * If the device supports 32 bit addressing;

   * If the device supports scatter/gather;

   * The device’s maximum transfer length.

HalGetAdapter() returns a pointer  to an (opaque) ADPATER_OBJECT, which the
typical driver will save  for later use. It also returns the maximum number
of mapping  registers that the driver will be allowed by  the HAL to use at
any  one time.  These  registers are  implemented  by the  HAL. Their  main
purpose  is to  allow devices to  reach user  buffers located any  where in
memory, irrespective  of the  DMA device’s native  addressing capabilities.
This capability  allows a busmaster DMA  device on the ISA  bus (which only
has 24  address lines),  to address all  of NT’s physical  memory (which is
presently  limited to  32  bits). Mapping  registers  also play  a part  in
helping  devices that do  not support  scatter/gather. But more  about that
later.

When processing a transfer request, a driver first calls KeFlushIoBuffers()
to  ensure the  described  data buffer  is ready  for  DMA. It  then  calls
IoAllocateAdapterChannel(). This function  takes as input, a pointer to the
Adapter Object returned by  HalGetAdapter(), a pointer to the Device Object
for the device doing the transfer, the number of mapping registers required
for this transfer, and a driver-defined context pointer.
IoAllocateAdapterChannel()  coordinates  access  to  the mapping  registers
(which are  shared resources), and when  a sufficient number are available,
calls the driver back  at its AdapterControl() function, a pointer to which
is also supplied as part of the call.

A  driver receives the  following inputs  to its  AdapterControl() routine:

   * A pointer to the Device Object;

   * A context pointer;

   * A pointer to the current IRP to process, if the
     IoAllocateAdapterChannel() was called from the driver’s StartIo()
     routine;

   * A MapRegisterBase value, for subsequent use.

In its AdapterControl()  routine, the driver retrieves the starting virtual
address  of the user’s  buffer by  calling MmGetMdlVirtualAddress() . Armed
with  all this  information, the driver  can now  get the base  address and
length  of the  data buffer  to be  handed off  to the  device for  the DMA
operation.

To  get the  base address  and length of  the data  buffer, a driver  calls
IoMapTransfer()  .   This   function   takes  the   following   parameters:

   * Pointer to the Adapter Object. Busmaster devices always pass a NULL
     for this parameter;

   * Pointer to the MDL describing this transfer;

   * The MapRegiserBase, which was passed into the AdapterControl() routine
     by the I/O Manager as a parameter (and is never modified by the
     driver);

   * The current virtual address of the transfer. This is just the value
     returned by MmGetMdlVirtualAddress() if this is the first fragment
     being mapped for this MDL, or else is that value updated by the length
     mapped so far (see below);

   * The (remaining) length of the data buffer;

   * A BOOLEAN indicating the direction of the transfer (TRUE is a write to
     the device).

Each call  to IoMapTransfer() returns a  (physical) base address and length
to be passed to the device for this transfer. Since the length returned may
be less than the length of the entire transfer, a driver must keep track of
this.  Devices  which  support  scatter/gather will  call   IoMapTransfer()
iteratively within the AdapterControl() routine. The resulting base address
and length  pairs will become the  device’s scatter/gather list. Given this
information,  a  driver  can  initiate  the  DMA transfer  on  the  device.
Busmaster devices  complete their  AdapterControl() functions  by returning
the special status value DeallocateObjectKeepRegisters.

After  this  particular  DMA  operation  is  complete,  the  driver  calls
IoFlushAdapterBuffers(),  passing a  pointer to the  Adapter Object (again,
busmasters pass NULL), a  pointer to the MDL, the MapRegisterBase, the base
virtual address for the  transfer, the total transfer length, and a BOOLEAN
indicating  the direction  of  the transfer.  Sounds like  almost  the same
parameters passed to IoMapTransfer(), doesn’t it?

If the  entire user  buffer has not  been transferred, a  driver will issue
additional calls  to  IoMapTransfer(), and  issue more DMA  requests to the
device, followed by more  calls to IoFlushAdapterBuffers(). When the entire
transfer has been completed, the driver calls IoFreeMapRegisters(), passing
a pointer  to the Adapter Object, the map register  base, and the number of
map registers that were used.

But, I Thought I Was Doing DMA?

So,  you’ve got  a device  that needs a  driver. It’s  a PCI device,  so it
supports  32-bit addressing.  But, it  doesn’t support  scatter/gather (and
there  are plenty  of  devices that  don’t). You  carefully  implement your
driver  The  NT  Way.  The  code  path  is  long  and, to  put  it  kindly,
complicated. Being the performance freak you are, you carefully profile and
benchmark your driver. You find:

   * There is a widely varying delay between your calls to
     IoAllocateAdapterChannel() and your Adapter Control routine being
     called. Sometimes, maybe even most times, there is no delay. Other
     times, the delay is "significant";

   * The CPU utilization when running your driver is much higher than you
     had hoped.

What’s going  on? Well first of all, it’s  important to understand that the
way many of the DMA support functions are implemented is up to the HAL. For
virtually every x86 HAL,  mapping registers are just an abstraction – there
is  no  specific  hardware that  implements  "mapping  registers" in  these
systems. Rather,  the HAL  implements these mapping  registers by reserving
(at system startup time) a group of page-sized buffers in low system memory
(below the  16MB mark). When a  DMA device that is  incapable of addressing
all of  physical memory (such as an ISA bus device)  wants to perform a DMA
transfer,  the HAL  uses one  or more  if its  pre-reserved buffers  in low
memory as an intermediate buffer for the transfer.

As an  example, let’s consider a  DMA read from such  an ISA bus device. In
the call  to IoAllocateAdapterChannel() , the HAL notices  that this device
does not  support 32  bit addresses, and  reserves the requested  number of
contiguous  buffers in  low  memory (the  number of  "map  registers", i.e.
buffers, being  an input parameter to  IoAllocateAdapterChannel()). Then in
response to  the driver call to  IoMapTransfer(), the  HAL returns the base
physical address of these  buffers in place of the base physical address of
the actual  data buffer. The data  from the device is  then DMA’ed into the
HAL’s  buffers. When  the  driver calls  IoFlushAdapterBuffers() ,  the HAL
merely copies the data from its buffers to the original data buffer.

But, you may ask,  what does this have to do with our PCI busmaster device?
Our PCI device supports  32-bit addresses. So there’s no reason for the HAL
to    use   these    intermediate   buffer   mapping    registers,   right?

Wrong!  There’s a catch:  Our PCI  device does not  support scatter/gather.
When  you called   HalGetAdapter(),  the HAL  noticed  this and  decided to
"help". It  did this  by performing the same  intermediate buffering action
for each  of your  device’s transfers that  it did for our  example ISA bus
device.  Thus, when  you  called  IoMapTransfer(),  it  allocated a  set of
contiguous buffers,  and returned you the  base physical address and length
of those buffers as the base address for the DMA input buffer. As a result,
the device  DMAed the data to  this intermediate buffer. The  data is later
recopied to the real destination data buffer. The HAL did this in its quest
to agglomerate fragments of  the data buffer, and thereby reduce the number
of  iterations required  by your  driver to  transfer the  complete buffer.

This accounts for the  higher than expected CPU utilization. After the data
has been  DMA’ed it’s  being copied out  of the HAL’s buffers  and into the
real data buffer. You  therefore get all the complexity of supporting a DMA
device, with all the CPU usage of a device that does programmed I/O!

This  also  accounts  for   the  varied  latency  between  your  calls  to
IoAllocateAdapterChannel(), and the  I/O Manager’s callback to your Adapter
Control function.  The HAL’s  mapping registers and low  memory buffers are
scarce resources.  Sometimes, they might  be in use by  another device when
you call IoAllocateAdapterChannel() . This causes your request to be queued
until the  requested number of map  registers/buffers are available. Hence,
you wait.

Just Lie

So, if you don’t  like the way the HAL is helping you  out, what can you do
about it? The answer is simple: You can lie! If you’re writing a driver for
a  32-bit busmastering  DMA device, and  you want  to avoid the  HAL’s help
described  above, all  you  need to  do is  tell the  HAL that  you support
scatter/gather. It doesn’t matter whether you really support scatter/gather
or not,  of course. This  way, the HAL won’t  get it into its  head that it
should  help  you, and  the  physical  buffer base  addresses  returned by
IoMapTransfer() will  be the ones corresponding  to the actual data buffer.

But, just as your  mother taught you, lying has its consequences. Not every
HAL will implement mapping registers for non-scatter/gather devices the way
I’ve  described.  Thus,  you sacrifice  some  degree  of compatibility  and
portability.

There’s another issue, however.  Will it really take less CPU time for your
driver to  perform multiple iterations to DMA the  data buffer? If the data
buffer  has a  couple  of large  fragments, then  the shortcut  is probably
worthwhile. If  the data buffer has lots  of little page-size fragments, it
might take more CPU  time to setup and complete the multiple DMA operations
than it  would to just copy  the data buffer and issue  one DMA (as the HAL
does).

So,  is this shortcut  worth it? Those  of you  who’ve attended our  famous
Windows NT  Kernel Mode  Device Driver  seminar already know  the answer to
this question: "It depends."  This particular lie seems pretty safe. So, if
you’re  interested in  reducing your  driver’s CPU  utilization as  much as
possible, and the data  buffer chunks are large, then gain in CPU time from
using this  shortcut is probably worth the  "cost" in potential (rare) loss
of portability.

Getting Less Abstract

How  about  drivers  for  32-bit  busmastering  devices  that   do  support
scatter/gather? Are there any shortcuts they can take?

Well, obviously NT’s model for a driver doing DMA is heavily focused around
coordinating  access to  shared  resources using  the Adapter  Object. But,
devices that  support scatter/gather  really don’t have  any resources that
they need  to share since they don’t use mapping  registers. So, do we need
to  go through the  complexity of  calling IoAllocateAdapterChannel() , and
then  having the  I/O  Manager (immediately)  call us  back at  our Adapter
Control routine?

Well,  we  build  the  scatter/gather  list  for  our  device  by  calling
IoMapTransfer(),  which is  called within our Adapter  Control routine. The
reason  it’s called  there is  that one of  the parameters  on the call  to
IoMapTransfer() is  the MapRegisterBase, which is passed  to us as input to
our Adapter Control routine.

Perhaps   we   can  build   the   scatter/gather  list   without  calling
IoMapTransfer() ? Well,  this  is certainly  possible. We  could  build the
scatter/gather  list directly by  decoding the  MDL, since the  current MDL
format is fairly clearly  described in ntddk.h. Is this a good idea? No. It
is in fact  a particularly hideous idea. The MDL is one of the fully opaque
data structures used by  the I/O Manager. There is absolutely no reason why
the format  of the MDL couldn’t  change from release to  release of NT. So,
while we’re in favor  of taking shortcuts, this particular shortcut doesn’t
seem to us to be a reasonable one to take.

If  we need  to use  IoMapTransfer() to  build the scatter/gather  list, is
there  a way  to call  it without  having an  Adapter Control  routine? The
answer here is "sure"!  The secret is in knowing that drivers that will not
be actually using mapping  registers for their transfers, seem to always be
passed  a  MapRegisterBase  value  of  zero.  Thus,  we  can  simply  call
IoMapTransfer(), without having an AdapterControl() function, as long as we
supply zero as the value for MapRegisterBase.

At least to me, this optimization seems a bit more reasonable than decoding
the MDL.  The abstraction of allocating an Adapter  Object, and the code it
involves,  is all  avoided.  And, we  still  use the  I/O Manager  supplied
function to decode the MDL format. While reasonably safe, this optimization
is also not without cost.

What are  the potential pitfalls? For one thing, not  every HAL might use a
value of zero for  MapRegisterBase to mean "no mapping registers used". So,
there is a potential compatibility issue.

In addition,  by using this shortcut you are  by-passing almost all of NT’s
carefully crafted DMA model.  This in itself could cause some compatibility
problem in the future. For example, it might cause you problems when 64-bit
memory  support is  added to Windows  NT. What  happens when the  system on
which your  driver is running has  more memory than can  be reached by your
32-bit  busmastering device?  If you  use the  standard NT DMA  model, it’s
possible  that  the  HAL  will  provide  intermediate  buffering  for  your
transfers by using its mapping registers in the same way it does for 24-bit
devices today. If you  bypass the IoAllocateAdapterChannel() step, you will
not get this help, and your driver will not work.

So,  once again,  whether this  optimization is  worth the cost  canonly be
answered given knowledge of your project.

It’s All Tradeoffs

As you have seen,  there are a number of shortcuts you can take in NT’s DMA
model to  optimize your driver’s  performance. But, as we  mentioned at the
outset, you’re on your own with these methods. Use them with care, when the
cost is  justified by  the gain, and you’ll  be on your way  to more direct
DMA.



Return to Previously Published Articles

Return to OSR's Home Page

  [Image]consulting& developing/ kits / seminars / NT insider/ resources /
                          client area / about OSR