From: RTPARK::RLB "Bob Boyd 8*565-3627 31-Jan-1989 0906" 31-JAN-1989 09:06 To: ARISIA::EVERHART Subj: 8800's & V5 -- also, read PP 8.49 in V5 Rel. Notes From: C5VB::POWNALL "The Wizard" 30-JAN-1989 15:16:04.59 To: RTPARK::RLB,POWNALL CC: Subj: RE: VAX 88xx + V5.0-* Bob, An update - I think we are going to see a resolution of this problem soon. An attempt was made last thursday evening/night to reproduce the crash under controlled circumstances. UETP was run with lots of looping data to the UB DMF32, and lots of loads on the cpus. Guess what! It didn't cooperate! Stayed up in spite of all their (Carl N. from Rochester and Doug S. from Syracuse) efforts. That prompted a visit from a Dick White from the boston area. He was here over the weekend and DID reproduce the bus errors on a regular basis, with logic analyzers and all hanging out on the busses. He had to resort to (figuratively) toggling some code into the naked VAX to do this, though. He was mostly able to reproduce the double error halts, which we see along with the invalid io exception. The double error halt is caused by the bus problem. Analysis of his findings goes like this. The hardware (bus converters) work. The basic problem starts at the NMI bus and its timing/latency. Add the intermediate BI timing/latency and the UNIBUS timing/latency and this is what happens. CPU A (either) requests a read-modify-write to a specific UNIBUS address (device register). Picoseconds (or whatever) later, CPU B requests a read from the SAME UNIBUS address. CPU B must hang out waiting for CPU A's long access to complete, then get its own data. In 99.99% of the cases, this works. However, the second read completes a tad before the NMI would time it out, except in the 0.01% case, when the read is timed out. The differentiation of the various types of crashes and bus states seems to depend on just how late the read actually completes. Once the read times out, its completion is signalled as a fault, but it seems to have two distinct modes: Cause an invalid io exception; or double error halt. On our system, the YC driver is the one and only culprit since we have no other devices on the UNIBUS. It, for historical and other reasons, has several (many?) places where it 'touches' the actual UNIBUS device registers. An obvious situation which will cause the above scenario is the primary CPU handling a port at interrupt level while the other CPU is doing a start i/o on a different terminal on the same DMF. The start i/o diddles the device at a non-interrupt ipl. On a uniprocessor or ASMP machine, no conflict, since start i/o usually does little more than set the interrupt bit. Mr. White proposes modifying YC driver to insert a mutex or semaphore (possibly a spinlock) to gate access to the device in software. DEC says they will not change the timing through the bus adapters, probably due to the very tight timing needed on the high end systems. I would accept a modified YC driver with software contention, as I like (need) to run lots of old UNIBUS stuff, and understand it is considered a LOW (LOW LOW LOW) performance bus on new VAXes. But I DO insist (demand, throw tantrums, etc) that the stuff WORK. The other driver mentioned by Dick was the XM driver (DMC/DMR?) which makes me wonder if the XE driver is a contender. It seems to me that what DEC is saying in a roundabout way (besides GET OFF THE UNIBUS!!!!!) is to restrict drivers of UNIBUS devices to a single CPU, or otherwise absolutely prohibit concurrent access. (I wonder if this is the problem that got terminals et al banished from 62xx UNIBUSses, or if that was a purely marketing decision?) Anyway, their plan is to find a dual processor machine with UNIBUS/DMF in DEC to do the patch, then try it here after they get it so it doesn't crash the system on its own. I suggested they (DEC) get in touch with you, as you have the appropriate configuration (complete with crashes), and might be able to give them a bit of system time. Just an aside - I am curious as to why you did not drop back to v4.7 when the sparks flew, and where you acquired such tolerant users. Ours threatened bodily harm after two weeks, and made us shift a significant portion of the load back to the 8800 (V4.7), which slowed them down some, and made the 8820 stay up longer. I am also curious as to what you define as 'off-hours', as our users consider them Christmas and Newyears every third year. Jim Pownall