From:	RTPARK::RLB  "Bob Boyd 8*565-3627  31-Jan-1989 0906" 31-JAN-1989 09:06
To:	ARISIA::EVERHART
Subj:	8800's & V5 -- also, read PP 8.49 in V5 Rel. Notes

From:	C5VB::POWNALL      "The Wizard" 30-JAN-1989 15:16:04.59
To:	RTPARK::RLB,POWNALL     
CC:	
Subj:	RE: VAX 88xx + V5.0-*

Bob,
   An update - I think we are going to see a resolution of this 
problem soon. An attempt was made last thursday 
evening/night to reproduce the crash under controlled 
circumstances. UETP was run with lots of looping data to the UB 
DMF32, and lots of loads on the cpus. Guess what! It didn't 
cooperate! Stayed up in spite of all their (Carl N. from 
Rochester and Doug S. from Syracuse) efforts. That prompted a 
visit from a Dick White from the boston area. He was here over 
the weekend and DID reproduce the bus errors on a regular basis, 
with logic analyzers and all hanging out on the busses. He had to 
resort to (figuratively) toggling some code into the naked VAX to 
do this, though. He was mostly able to reproduce the double error 
halts, which we see along with the invalid io exception. The 
double error halt is caused by the bus problem.
   Analysis of his findings goes like this. The hardware (bus 
converters) work. The basic problem starts at the NMI bus and its 
timing/latency. Add the intermediate BI timing/latency and the 
UNIBUS timing/latency and this is what happens. CPU A (either) 
requests a read-modify-write to a specific UNIBUS address (device 
register). Picoseconds (or whatever) later, CPU B requests a read 
from the SAME UNIBUS address. CPU B must hang out waiting for CPU 
A's long access to complete, then get its own data. In 99.99% of 
the cases, this works. However, the second read completes a tad 
before the NMI would time it out, except in the 0.01% case, when 
the read is timed out. The differentiation of the various types 
of crashes and bus states seems to depend on just how late the 
read actually completes. Once the read times out, its completion 
is signalled as a fault, but it seems to have two distinct modes: 
Cause an invalid io exception; or double error halt.
   On our system, the YC driver is the one and only culprit since 
we have no other devices on the UNIBUS. It, for historical and 
other reasons, has several (many?) places where it 'touches' the 
actual UNIBUS device registers. An obvious situation which will 
cause the above scenario is the primary CPU handling a port at 
interrupt level while the other CPU is doing a start i/o on a 
different terminal on the same DMF. The start i/o diddles the 
device at a non-interrupt ipl. On a uniprocessor or ASMP machine, 
no conflict, since start i/o usually does little more than set 
the interrupt bit. Mr. White proposes modifying YC driver to 
insert a mutex or semaphore (possibly a spinlock) to gate access 
to the device in software. DEC says they will not change the 
timing through the bus adapters, probably due to the very tight 
timing needed on the high end systems.
   I would accept a modified YC driver with software contention, 
as I like (need) to run lots of old UNIBUS stuff, and understand 
it is considered a LOW (LOW LOW LOW) performance bus on new 
VAXes. But I DO insist (demand, throw tantrums, etc) that the 
stuff WORK. The other driver mentioned by Dick was the XM driver 
(DMC/DMR?) which makes me wonder if the XE driver is a contender.
It seems to me that what DEC is saying in a roundabout way 
(besides GET OFF THE UNIBUS!!!!!) is to restrict drivers of 
UNIBUS devices to a single CPU, or otherwise absolutely prohibit 
concurrent access.
   (I wonder if this is the problem that got terminals et al 
banished from 62xx UNIBUSses, or if that was a purely marketing 
decision?)
   Anyway, their plan is to find a dual processor machine with 
UNIBUS/DMF in DEC to do the patch, then try it here after they 
get it so it doesn't crash the system on its own. I suggested 
they (DEC) get in touch with you, as you have the appropriate 
configuration (complete with crashes), and might be able to give 
them a bit of system time.
    Just an aside - I am curious as to why you did not drop back 
to v4.7 when the sparks flew, and where you acquired such 
tolerant users. Ours threatened bodily harm after two weeks, and 
made us shift a significant portion of the load back to the 8800
(V4.7), which slowed them down some, and made the 8820 stay up 
longer. I am also curious as to what you define as 'off-hours', 
as our users consider them Christmas and Newyears every third 
year.
                                          Jim Pownall