Everhart, Glenn
From:	Michael McAllister [mcallister@annapolis.net]
Sent:	Thursday, October 15, 1998 1:32 AM
To:	'srinivasarao'
Cc:	'ntdev@atria.com'
Subject:	RE: [ntdev] slow data transfer 
Oh, my.   And I thought it was just me was going crazy a few months ago!

I wrote to this list a while back about slow accesses to memory.  In my 
case, I wrote a driver (for a PCI device) that need to map a buffer from 
the kernel's virtual space into a process in user virtual space.  I, too 
used ZwMapViewOfSection.  And I too saw VERY slow accesses to this memory. 
 For example, performing a memcpy() in the User Process (to or from) this 
"ZwMap'd" buffer took 4 TIMES as long as a memcpy() to a second buffer that 
was just malloc'd normally in that user process.

I wrote to this list asking for help.  I got a lot of responses (thanks to 
all!!).  The one that fixed my problem came from George Blat... it read (in 
part):

"Allocate contiguous physical memory in your driver with 
MmAllocateContiguousMemory. Get the system virtual address.  call 
IoAllocateMdl to get a MDL pointer. Do not associate it with any Irp.  call 
MmBuildMdlForNonPagedPool to fill the page information call MmMapLockedP  
ages with UserMode to get an user reachable address that you can return on 
your IOCTL."

As soon as I did the mapping between Kernel Virtual and User Virtual with 
"MmMapLockedPages" instead of with "ZwMapViewOfSection", the thing took 
off... the buffer accesses to the mapped Kernel Contiguous Buffer were JUST 
AS FAST as buffers malloc'd from the user process.  It was that simple. I 
had spent weeks going off on all kinds of tangents (and learning A LOT in 
the process) and all it took was that simple change to see the buffer 
access improve to where they should have been all along.  I never did 
figure out WHY ZwMapViewOfSection was such a dog....  I am now working on a 
different project.

I don't know if you can use the same mapping technique... I was trying to 
map contiguous memory (only allocatable from the kernel) into user space so 
that the user program could fill it with data, then a DMA engine on my 
hardware could efficiently move it to the hardware device's 32 meg buffer. 
 But it is something to maybe take a look at.  It increased my buffer 
accesses dramatically!   And to anyone that remembers my question about 
whether people really see 132 MB/s on the PCI bus?   YES!   I average 
125MB/s throughput on my 333MHz Dell Pentium II, but I have hit 130 on 
occasion.

Again, I don't know if this will help you... I really don't know as much as 
the rest of the guys on this list about how NT does things, and their 
responses are probably much better than mine.  But if you are stuck, maybe 
the above info will help.


Thanks again Mr. Blat & all those who helped me!


-Michael McAllister





-----Original Message-----
From:	srinivasarao [SMTP:dsrao@ada.ernet.in]
Sent:	Wednesday, October 14, 1998 1:55 PM
To:	Dave Harvey; srinivasarao; 'Bhaduri, Arnab'; ntdev@atria.com; 'Mark 
Roddy'
Subject:	RE: [ntdev] slow data transfer

Hi friends,

	Thank U somuch for taking pain on behalf of me . At present I am following 

the Example "MAPMEM" in DDK exactly . I am not using any cache.In fact I
have declared carefully in Tranalation commands . My Hardware consists of
Device Memory of 64 k size and some ports at 0x140 and an Interrupt .  I
have mapped both ports and memory seperately . Then I made use of
"ZwMapViewOfSection" in kernelmode to get "Virtual Address" in User Space . 

So, I get a address from kernel mode to my Application  through
DeviceIoControl command . Using this address as a Base , I am trying to
write to different locations by varying offset .
	The reason why I am believeing my driver is working  is, I am following
standard method using examples . I am checking at each level the return
values . Initially I too was a having doubt that whether I am writing to my 

driver or not . Later I found that it needs some  delay for proper
operation . Anyway this is my understanding only . Please Let me know if I
am doing anything wrong. I think Mark has given some suggestion ,  I don't
know how to implement it . In the example "MAPMEM" , he is using
"ioaccess.h" to write into registers . These macros as good as using i/o
calls . Here is the code


//
// I/O space read and write macros.
//
//  The READ/WRITE_REGISTER_* calls manipulate MEMORY registers.
//  (Use x86 move instructions, with LOCK prefix to force correct behavior
//   w.r.t. caches and write buffers.)
//
//  The READ/WRITE_PORT_* calls manipulate I/O ports.
//  (Use x86 in/out instructions.)
//



#if defined(_X86_)

#define READ_REGISTER_UCHAR(Register)          (*(volatile UCHAR
*)(Register))
#define READ_REGISTER_USHORT(Register)         (*(volatile USHORT
*)(Register))
#define READ_REGISTER_ULONG(Register)          (*(volatile ULONG
*)(Register))
#define WRITE_REGISTER_UCHAR(Register,Value)  (*(volatile UCHAR
*)(Register) = (Value))
#define WRITE_REGISTER_USHORT(Register, Value) (*(volatile USHORT *)(Reg
ister) = (Value))
#define WRITE_REGISTER_ULONG(Register, Value)  (*(volatile ULONG
*)(Register) = (Value))
#define READ_PORT_UCHAR(Port)                  inp (Port)
#define READ_PORT_USHORT(Port)                 inpw (Port)
#define READ_PORT_ULONG(Port)                  inpd (Port)
#define WRITE_PORT_UCHAR(Port, Value)          outp ((Port), (Value))
#define WRITE_PORT_USHORT(Port, Value)         outpw ((Port), (Value))
#define WRITE_PORT_ULONG(Port, Value)          outpd ((Port), (Value))


#elif defined(_PPC_) || defined(_MIPS_)

#define READ_REGISTER_UCHAR(x)      (*(volatile UCHAR * const)(x))
#define READ_REGISTER_USHORT(x)     (*(volatile USHORT * const)(x))
#define READ_REGISTER_ULONG(x)      (*(volatile ULONG * const)(x))
#define WRITE_REGISTER_UCHAR(x, y)  (*(volatile UCHAR * const)(x) = (y))
#define WRITE_REGISTER_USHORT(x, y) (*(volatile USHORT * const)(x) = (y))
#define WRITE_REGISTER_ULONG(x, y)  (*(volatile ULONG * const)(x) = (y))
#define READ_PORT_UCHAR(x)          READ_REGISTER_UCHAR(x)
#define READ_PORT_USHORT(x)         READ_REGISTER_USHORT(x)
#define READ_PORT_ULONG(x)          READ_REGISTER_ULONG(x)

//
// All these macros take a ULONG as a parameter so that we don't
// force an extra typecast in the code (which will cause the X86 to
// generate bad code).
//

#define WRITE_PORT_UCHAR(x, y)      WRITE_REGISTER_UCHAR(x, (UCHAR) (y))
#define WRITE_PORT_USHORT(x, y)     WRITE_REGISTER_USHORT(x, (USHORT) (y))
#define WRITE_PORT_ULONG(x, y)      WRITE_REGISTER_ULONG(x, (ULONG) (y))


In the above code I am using "write_register_*() calls to write and
read_register_*() for reading from registers using the virtual address what 

I got from kernel mode . Hope U can give me some solution with this .


Thanks in advance

srinivas


 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[ To unsubscribe, send email to ntdev-request@atria.com with body
UNSUBSCRIBE (the subject is ignored). ]

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[ To unsubscribe, send email to ntdev-request@atria.com with body
UNSUBSCRIBE (the subject is ignored). ]