Article 163050 of comp.os.vms:
moroney@world.std.com (Michael Moroney) writes:

>In article <greg.853445589@gtravis>, greg@indiana.edu (Gregory Travis) wrote:
>> campbejr@phu989.mms.sbphrd.com (John R. Campbell) writes:
>> 
>4> Instructions were 15/30/60 bit so you could have up to 4 instructions
>> per word.  

>The 60 bit instructions (character manipulation) were added as an
>afterthought. Before these, the machine was RISC long before RISC became
>"popular", and there were only 78 instructions. Opcodes were the first 6
>bits except for two "opcodes" which used the next 3 bits to determine which
>of 8 instructions it was.

You're talking about the "CMU" (Compare and Move Unit) which implemented
VAX-like (close enough) character manipulation instructions.  They allowed
you (in assembly) to work on byte (12 bit (6 bit?)) sized data instead of
having to grab whole words.  I don't remember exactly but I don't believe
you could put a CMU in a 6600.  I know it certainly was an option on our
later/slower/smaller Cyber 172 though.

>Minor nit: CIO (combined input/output) was only one of several PPU programs
>that could be requested.  CIO was the most common PPU program run since it
>did all disk (and other?) I/O. Other PPU programs were somewhat similar
>to VMS system services.  Which PPU program was requested was determined by
>the 3 letter code at the high 18 bits of word at address 1, the lower bits
>specified parameters or addresses of various sorts.

Absolutely correct - I oversimplified.  As Mike points out, there was one
PPU (PPU 0) which always ran just one program.  The other PPUs were available
for running any PPU program.  The program in PPU 0 is closest to what we would
today call the "Kernel"  Its duties, if I remember correctly, were roughly:

	1.  Make sure the CPU had not halted (because a program had executed
	    the "PS" (program stop) instruction.
	2.  Timeslice among the running processes (i.e. interrupt the CPU)
	3.  Scan each program's address 1 (as Mike pointed out) to see if
	    the program was requesting a system service (CIO/etc.)
	4.  Schedule PPU programs on the other PPUs and monitor their health

In addition to the program in PPU 0, there was a small system "kernel" in
CPU memory as well.  PPU 0 would schedule this code to perform certain
operations that the CPU could do faster than a PPU (such as memory copies)
and other stuff.  The PPUs could stop/start the CPU because they could write
the CPU's program counter.

I cannot, for the life of me, remember the name of the program which ran
in PPU 0.

The number of PPUs that a system had varied - you could order more if
you wanted up to a maximum of 20 (not quite sure on the number).

Each PPU consisted of only 4K of 12-bit bytes.  Many of the larger PPU
programs, towards the end of the system's development, were not fitting
into 4K.  The solution to this was PPU overlays.

Any PPU could attach to any "channel"  A channel was a datapath to a
device.  When PPUs were idle they looped on a certain channel waiting
for data.  The other end of this channel would get connected to PPU 0
when it wanted to load a PPU program into a given PPU

Deadstart (IPL to you IBM boys or "reboot" to UNIX people) on the 6600
was something people talked about in low wispers.  On the 6600 there was
a bay which could be opened up.  Inside the bay was a matrix of 12 x 12
toggle switches and a pushbutton labelled "deadstart".  The switches
represented 12 words of 12-bit PPU instructions.  When the deadstart
switch was pressed, the contents of the switch matrix were dumped, via
channel 0, into PPU 0's memory starting at location 0.  The instruction
matrix represented a simple bootstrap program which loaded a larger bootstrap
from a disk or tape, beginning the system startup procedure.

>For some reason unconditional branches were always compiled as "conditional"
>branches where register B0 was compared to register B0 and if the same
>the branch was taken, despite the existance of an unconditional Jump
>instruction.  B0 was always equal to B0, of course.  (B0 was also always 0,
>this was done so fewer instructions were needed) I *believe* this was done
>since the unconditional Jump instruction wiped out this cache, or was slower
>somehow.

Yes, as Mike pointed out there was the
		JP	<B-Register>, Address

instruction which jumped to Address + Contents of the B-register.  The
most common form was probably just:

		JP	Address

Which Compass generated as:  JP  B0, Address.  Since B0 was hardwired to
the value 0, the was an unconditional jump to <address>.  However, there
were also comparison jumps which could be done on B registers such as:

		EQ	B4, B5, Address

Which jumped to <Address> if B4 and B5 were equal.  This meant that:

		EQ	B0, B0, Address

Was functionally equivalent to "JP Address."  However, "EQ"
was actually a little faster than "JP" (by 100ns) so everyone, including
the compilers, always coded an unconditional jump as:

		EQ	<Address>

>> Since 6600 had no JUMP TO <ADDRESS IN REGISTER> instruction, I constructed

>The JP instruction didn't specify a B register as one of its inputs? Been
>so long now.

You are right, I just looked it up in "Assembly Language Programming."  Now
I'm wondering what rational I had, if any, for constructing the jump
instruction on the fly as I did; it was eighteen years ago :-(.

>Another interesting feature of this machine was no stack.  Also subroutines
>were implemented such that the HARDWARE performed self-modifying code!
>(The first word of a subroutine was left blank and the subroutine call
>instruction wrote a Branch to the instruction after the subroutine call into
>this cell! The subroutine "returned" by branching to this cell!)

Yes, this was the only real weak point of the machine.  The fact that the
hardware stored the return address IN the subroutine made it damn difficult
to implement recursive or reentrant subroutines.

Some more "stuff" about the 6600.  As most people know, the machine's
memory had no hardware error detection at all.  Cray's famous quote at
the time was "Parity is for farmers."  When parity was added to the later
7600 series Cray was asked what made him change his mind.  His response:
"Farmers buy a lot of computers."

Anyway, it was still important to diagnose and locate failing memory.  There
was a system CPU program (whose name I forget) whos whole purpose in life
was to write patterns into memory and then read them back, flagging an
error if it didn't get what it expected.  This program ran at a low priority
level and would write and read a pattern and then ask the system to
"roll it out" (what we would call a swapout today).  When it got rolled back
in, it was almost always rolled back in at a different physical address.
In that way it eventually got around to probing all of memory.

There was a separate program, much like the memory checker, which did the
following all day.  He would RANDOMLY generate a set of instructions
and then INTERPRET those instructions and note the result.  Then he would
execute the same set of instructions on the hardware.  If the results that
the hardware came up with were different from what the interpreter produced
then the program would flag a failing CPU.

I can't remember what a divide-by-zero did on the CPU hardware.  I don't
recall there being any mechanism (until the later XJ ins

As for the speed of the machines, here are some numbers to chew on:

It became apparent that the 18-bit address size of the 6600 (and its
decendent the big 7600) was a constraint in big systems.  So CDC came out
with ECS - extended core storage.  ECS was simply bulk RAM to which the
contents of CPU memory could be quickly stored or retrieved.  Instructions
could not be executed directly out of ECS.

The 6600 could transfer information to and from the ECS at 10 million
words per second, or roughly 75MB/s.  This was late 1960s technology.

The 7600 could transfer information to and from the ECS at 36 million
words per second, or roughly 270MB/s.  This was early 1970s tech!

Instruction cycle times on the 6600 were expressed in either minor or
major cycles.  A minor cycle was 100 nanoseconds, a major cycle a microsecond.

The 6600 could add two 60-bit REALs in four minor cyles (400ns) - thus
it could do 2.5 million 60-bit REAL adds per second.  A multiply took
quite a while longer - a whole major cycle and a divide took nearly three
major cycles.

Most 6600 instructions typically took either 3 or 4 major cycles.  Branches
took a little longer (branches were very expensive on the 6600).  Thus
the 6600 was nominally about 3 million instructions per second (I earlier
wrote 1 mips).  However, the instruction unit was capable of fetching an
instruction from memory every minor second (100ns), giving a MAXMUM speed
of 10 million instructions per second.  Not bad for 1964.

It was possible to get much higher speeds than the cycle time of any
one instruction because of the multiple function units.  Instructions
were issued to the functional units based on their type and different
functional units could run in parallel if they did not conflict in their
register usage.

Parallelism occurred more often than not.  Here is the breakdown of
the 6600/7600 functional units:

	Branch
	Boolean
	Shift
	Long (60 bit) add
	Floating add
	Divide
	Multiply (there were TWO of these, both identical)
	Increment (again, TWO increment units)

Thus:

	SB4	B3+40		(Set B4 to the contents of B3 + 40)
	SB4	B4+20		(Increment B4 by 20)

Took a total of 600ns (because of the conflict with B4 and each increment
is a 300ns instruction) but:

	SB1	B2+10
	SB3	B4+5

Took only 300ns because there was no register conflict and each instruction
could be issued to one of the two increment units in parallel.

Likewise, since there were two floating-point multply units and since a
60-bit FP multiply nominally took a major cycle (1000ns), if you coded
right you could get two of them going in parallel giving an effective
60-bit FP time of 500ns (2 million per second).

Also, remember that loading a value into an "A" register caused the
value of the corresponding memory location to be loaded into the associated
X register:

	SA4	10		(load the contents of location 10 into X4)

This instruction took 300ns to execute plus an additional 500ns to
get the word from memory.  Thus it was bad practice to load a value
and then IMMEDIATELY use the value.  It was much better to anticipate the
load a few instructions before using it:

	SA4	10
	IX4	X4+X4	(Add X4 to itself)

	This ties up the machine since IX4 cant start until the load
	finishes.  Whereas:

	SA4	10
	IX3	X2+X5
	SB3	10
	IX4	X4+X4

	Pretty much filled up the functional unit pipeline

The fortran compilers were QUITE adept at scheduling this kind of thing as
were the various math libraries.

The 7600 was quite a bit faster than the 6600 - it could do a floating
point multiply in 137.5ns (see memory timing below) - or over 7 million
60-bit floating multiplies per second.  It could do a 60-bit integer
add in 55ns (18 million per second) vs 300ns for the 6600.

I was incorrect when I wrote that the 6600 had four banks of memory.  It
had thirty-two banks of core which gave the 6600 the ability to do 10
million word reads or writes per second.  The 7600 was faster with an
access time of 137.5 nanoseconds and a full cycle time of 275 nano seconds.
Again, these are for 60-bit words.


>instruction execution) even if much seemed kludgy. Mr. Cray (RIP) was a
>genius.

I can't think of anything in the 6600 architecture that I could remotely
consider Kludgy, even the subroutine calling convention.  The instruction
set was remarkably rational.  The CMU instructions and the CMU's use
were awkward and Kludgy, but that was a Control Data addon and didn't come
from Seymour.  Well, the PPU instruction set was a little wierd.

Seymour Cray is the only person in the computer industry who I considered
absolutely infallible.  The 6600 was a tour-de-force and its existence makes
machines like the S/360 family simply incomprehensible.

I can't believe I wrote this much,

greg
--

greg		greg@indiana.edu	http://gtravis.ucs.indiana.edu/