Path: news.mitre.org!blanket.mitre.org!philabs!newsjunkie.ans.net!newsfeeds.ans.net!news-was.dfn.de!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!news-peer.sprintlink.net!news-backup-west.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!199.232.56.18!news.ultranet.com!not-for-mail From: "Jim McCollum" Newsgroups: comp.os.ms-windows.programmer.nt.kernel-mode Subject: KeInsertQueueDpc bug on SMP systems? Date: Thu, 15 Jan 1998 12:43:00 -0500 Organization: UltraNet Communications, Inc. http://www.ultranet.com/ Lines: 67 Message-ID: <69lhm3$q2r$1@decius.ultra.net> NNTP-Posting-Host: 146.115.154.11 X-Complaints-To: abuse@ultra.net X-Ultra-Time: 15 Jan 1998 17:40:19 GMT X-Newsreader: Microsoft Outlook Express 4.71.1712.3 X-MimeOLE: Produced By Microsoft MimeOLE V4.71.1712.3 I've been getting IRQL_NOT_LESS_OR_EQUAL bugchecks out of the NT kernel that, after lots of debugging, appears to me to be a bug in KeInsertQueueDpc. Analysis of the crash shows that the DPC queue, the header of which is located in the processor control region (PCR), is corrupted. The nature of the corruption is that a DPC has a forward link on the DPC queue in one processor's PCR while the backward link is linked to the PCR DPC queue of another processor. I spent a great deal of time tracking this down, including unassembling the code in KeInsertQueueDpc, and I've managed to convince myself that KeInsertQueueDpc is not SMP safe. Here's what happens. Two threads running on separate processors make simultaneous calls to KeInsertQueueDpc specifying the same DPC object. Because the DPC is not targetted to a specific processor, each thread attempts to queue the object to the local processor's DPC queue. KeInsertQueueDpc performs the following steps (this is slightly simplified): 1) Raises IRQL to HIGH_LEVEL (31). 2) Acquires a spinlock which is located in the PCR (hence, each thread is allocating a *different* spinlock). 3) Manipulates the flink/blink in the DPC object to place it on the local processor's DPC queue. 4) Requests a software interrupt (to force processing of the DPC queues). 5) Releases the spinlock acquired in (2). 6) Restores IRQL. 7) Returns to the caller. Because the spinlock allocated in step (2) is located in the PCR, each thread allocates a different spinlock. They then proceed to step 3, simultaneously attempting to manipulate the list entry in the DPC object, corrupting them. Both threads then release their respective spinlocks, lower IRQL and exit KeInsertQueueDpc, leaving behind corrupted processor DPC queues. When NT later attempts to retire the DPC and remove it from the queue, it stumbles over the corrupted links and crashes. My driver is nowhere near the stack, but its DPC object is always implicated in the resulting corrupted DPC queues. After staring at this code, it became clear that while the spinlock in step (2) above protects the DPC queue in the PCR, the DPC object itself is not protected. I am able to workaround the crash by associating a spinlock with my DPC object and acquiring it before calling KeInsertQueueDpc. This prevents multiple threads from going through KeInsertQueueDpc simultaneously for the same DPC object and the crash disappeared. I distilled the code from my driver and wrote a small driver with only a few routines which will crash an SMP system in a matter of minutes, if not seconds. All this driver does is start up a bunch of threads, which do nothing but sit in a loop and periodically call KeInsertQueueDpc, specifying a dummy DPC routine. I then applied the above workaround and sure enough the crashes went away. Has anyone else seen these crashes? I'm running 4.0 with service pack 3. While I am able to prevent these crashes out of my driver with the workaround, I'm concerned that other NT drivers may unknowingly be susceptible to this problem. I'll include the source code for the driver which will crash an SMP system in a reply to this entry. Thanks, Jim McCollum Marathon Technologies Corporation