To B-Cache or Not to B-Cache The following note was written in response to questions about the "cacheless" Rawhide, also known as the AlphaServer 4100 5/300E. It tries to explain why some applications may find the cacheless Rawhide faster than the cached variant. The note was originally posted to: MVBLAB::ALPHASERVER_4100 ==================================================== Note 79.1 To B-cache or not to B-cache 1 of 10 PERFOM::HENNING 214 lines 28-MAY-1996 17:16 -< Draft 1 >- ---------------------------------------------------- A few edits have been made for this posting to the web 8-October-1996. ---------------------------------------------------------------------------- To B-Cache or Not to B-Cache J. Henning CSD Performance Group A question was asked about the performance benefits of a cached Rawhide (AlphaServer 4100 5/300) vs. the uncached Rawhide (AlphaServer 4100 5/300E). The benefit will vary depending on the application. This note contains some thumbnail predictions about application types, and some data based on the SPEC95 suite. This note may be forwarded both inside and outside Digital, if you see a need to do so -- for example, if you are trying to explain the difference between the two systems to an individual customer. (AlphaServer Marketing approved this usage 16 Oct 1996.) Thumbnail Predictions * Integer programs will tend to benefit from the cache. This includes TP, database, and general timesharing. * Scientific programs will show more of a mixture. Some will benefit greatly from the cache; some will not show much benefit; a few will even do better without it. ---------------------------------------------------------------------------- Details Rawhide Caches The cache structure on the Alpha 21164 (aka ev5 and ev56) is: Level 1 -- 8 KB Instruction + 8 KB data Level 2 -- 96 KB Secondary cache, on chip Level 3 -- Optional board-level cache These caches are often referred to as the I-cache, D-cache, S-cache, and B-cache. The caches form a portion of an overall memory hierarchy. Above them, at the fastest part of the hierarchy, are the internal processor registers; below the caches are main memory, disks, and tapes. For Rawhide, the 5/300E has no B-cache; the 5/300 has a 2mb B-cache; and the 5/400 has a 4mb B-cache. Integer codes 8 out of 8 SPECint95 benchmarks ran faster on the 5/300 than on the 5/300E. ------------ ------------ AlphaServer AlphaServer 4100 5/300E 4100 5/300 no B-cache 2mb B-cache ------------ ------------ Run Spec Run Spec BR/ Benchmarks Time Ratio Time Ratio NBR ------------ ----- ----- ---- ------ ---- 099.go 553 8.31 455 10.1 1.22 124.m88ksim 243 7.81 242 7.86 1.01 126.gcc 275 6.19 235 7.23 1.17 129.compress 372 4.83 274 6.57 1.36 130.li 291 6.54 246 7.73 1.18 132.ijpeg 328 7.31 323 7.43 1.02 134.perl 198 9.59 194 9.77 1.02 147.vortex 352 7.66 305 8.86 1.16 SPECint95 7.15 8.11 1.13 The column "BR/NBR" is the result of dividing "Peak Ratio" for the B-cache system by Peak Ratio for the No-B-cache system. The higher the number, the more the 5/300 outperformed the 5/300E. ---------------------------------------------------------------------------- Although not studied here, it is likely that transaction processing and database applications would benefit from the B-cache at least as much as SPECint95 does. Dick Sites and others have studied TP/DB applications and have seen a large instruction stream demand, which is satisfied in the B-cache. Shifting that demand to main memory would both be slower for the individual transaction and add unwanted system-wide traffic. General timesharing can also be expected to provide an I-stream demand that is more diverse than a system dedicated to running a SPECint95 benchmark (but not so diverse as to render the B-cache useless). For example, if one adds up the executables required to implement "The Ten Most Useful [UNIX] Commands and Constructs" (UNIX for the Impatient, Abrahams and Larson, 1992, pp. 10-11), 795KB is required. In short, SPECint95 may understate the benefit that a B-cache will give to more heterogenous real-life db/tp/timesharing workloads. Scientific codes What kind of applications wouldn't mind if one level of the memory hierarchy is removed? When would the B-cache not matter? Obviously, if the application "footprint" (the heavily used portion) is under 96KB, the B-cache might not be necessary. But such applications are rare. Some old benchmarks may fit in 96KB; realistic applications are likely to demand much more. Is there any other kind of application that would not be harmed by removing a level in the memory hierarchy? Hint: think about the 90's corporate world. When can you remove a level of a hierarchy? Right. If a level of the hierarchy does nothing but pass information from one level to another, you may find that removing that level does no harm. Notice that for SPECfp95, 5 of the 10 benchmarks differed by less than 5% between the 5/300E and the 5/300: ------------ ------------ AlphaServer AlphaServer 4100 5/300E 4100 5/300 no B-cache 2mb B-cache ------------ ------------ Run Spec Run Spec BR/ Benchmarks Time Ratio Time Ratio NBR Winner ------------ ---- ------ ---- ------ ---- ------ 101.tomcatv 182 20.3 197 18.8 .93 N 102.swim 274 31.4 308 27.9 .89 NN 103.su2cor 237 5.90 186 7.52 1.27 BBB 104.hydro2d 346 6.93 351 6.83 .99 - 107.mgrid 242 10.3 245 10.2 .99 - 110.applu 274 8.03 282 7.80 .97 - 125.turb3d 363 11.3 356 11.5 1.02 - 141.apsi 217 9.67 163 12.9 1.33 BBB 145.fpppp 439 21.9 437 22.0 1.00 - 146.wave5 241 12.4 192 15.6 1.26 BBB SPECfp95 12.0 12.7 1.06 B The column "BR/NBR" is the result of dividing "Peak Ratio" for the B-cache system by Peak Ratio for the No-B-cache system. Numbers greater than 1.0 mean the 5/300 outperformed the 5/300E. The column "Winner" awards one letter for being 5% faster (in the previous column), another for 10%, another for 20%. ---------------------------------------------------------------------------- One benchmark, swim, runs substantially faster without a B-cache. What does swim do? According to IPROBE (a profiling tool available by contacting Greg Tarsa of the CSD Performance Group), three routines add up to more than 90% of swim's CPU time: Routine %cpu ------- ---- calc1_ 26.7 calc2_ 35.8 calc3_ 33.6 Looking at calc3, IPROBE says that over 2/3 of its time is spent in one loop's memory instructions, sweeping through 9 arrays. Each of these arrays occupies a bit over 1M bytes, for a total footprint of over 9MB - far larger than the 5/300's 2mb Bcache. Therefore each iteration of the loop ends up loading the entire arrays from main memory; the caches do not provide any re-use of data. Similarly, calc1 has a tight loop that sweeps through 6 arrays (each a bit over 1M) and calc2 sweeps through 9. The relatively large array sizes and the lack of data re-use explain why the Bcache does not benefit swim. But why is the 5/300E *faster* than the 5/300 for swim? The reason that the 5/300E sometimes outperforms the 5/300 on scientific codes is that Rawhide's main memory bandwidth is actually better without a cache. For example, the McCalpin Streams memory bandwidth benchmark shows: Benchmarks and Metrics 4100 5/300E 4100 5/300 4100 5/400 ====================== =========== ========== ========== McCalpin STREAMS Triad 1 CPU 301.7 234.2 268.0 (The above is extracted from PERFOM::CSG_REPORTS:UNIX_SVR_PERF_FLASH_960506.TXT or .PS published by the CSD Performance Group.) For an introduction to main memory bandwidth issues, click here . Returning to the analogy with 90's corporate downsizing, if a level of a hierarchy is truly doing nothing but passing information from one level to another, eliminating the level may even make the work proceed more quickly. It's one less link in the chain. Summary Q. So when is a customer likely to find a 5/300E just as satisfying (or more satisfying) than a 5/300? A. If the customer routinely crunches large scientific problems that sweep the caches, they may find the 5/300E performs as well as the 5/300. But for integer workloads (including database, transaction processing, and general timesharing) most customers will probably find better performance with the 5/300. /John Henning CSD Performance Group Digital Equipment Corporation Email: henning@zko.dec.com Digital Internal Use Only Homepage: http://tlg-www.zko.dec.com/~henning