MessageFrom: Dave Harvey [dharvey-ntdev@syssoftsol.com] Sent: Tuesday, May 14, 2002 11:37 PM To: NT Developers Interest List Subject: [ntdev] Re: Generic Spin Lock Question As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not "depend". Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why: 1) Each acquire/release has one (...AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention. 2) If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines. In more detail: CPU 1 gets the lock A dirty in cache. CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1 CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2 CPU 2 gets a read only copy of the cache line (another miss), sees the lock free. CPU 2 invalidates the line on CPU 1 and sets the lock. CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1 CPU 1 gets a read only copy of the cache line, sees the lock free. CPU 1 invalidates the line on CPU 2 and sets the lock. In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies). Holding the lock the entire time, you get: CPU 1 gets the lock A dirty in cache. CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2 CPU 2 gets a read only copy of the cache line (another miss), sees the lock free. CPU 2 invalidates the line on CPU 1 and sets the lock. In this case, there are 3 misses and 1 shared=>exclusive bus transaction. So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of instructions to be worse than 3 extra cache misses. 3) Less lock gets/releases => simpler code => more likely to work in all cases. 4) Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either. Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line..... -DH "Bill McKenzie" wrote in message news:28200@ntdev... This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way. -- Bill McKenzie "PJ Kirner" wrote in message news:28180@ntdev... I recently had some code recently that was structured like -Claim SpinLock A -Do about 10 lines of code where claiming the spin lock is NECESSARY -Do another 10 lines of code where holding the spin lock is NOT NECESSARY -Do another 10 lines of code where claiming the spin lock is NECESSARY -Release SpinLock A The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible. Is the above code structure more efficient or is the following structure more efficient -Claim SpinLock A -Do about 10 lines of code where claiming the spin lock is NECESSARY -Release SpinLock A -Do another 10 lines of code where holding the spin lock is NOT NECESSARY -Claim SpinLock A -Do another 10 lines of code where claiming the spin lock is NECESSARY -Release SpinLock A I know there is probably no good answer for this because there are so many variables coming into play, but let's say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long. Thanks for any input. PJ --- You are currently subscribed to ntdev as: GlennEverhart@FirstUSA.com To unsubscribe send a blank email to leave-ntdev-247T@lists.osr.com