setast When testing with a trailblazer on a uV2000, I noticed that transmission from the 2000 was quite "bursty". I'd send three packets, get three acks, and repeat. This is not the continuous transmission I'd like to get, and indeed, effective throughput was only about 9000 bps. Sometimes it even slipped into a two-at-a-time mode. (Send two, get two acks.) In no case I found was putdata waiting for a window open. I didn't measure it while receiving. It is not at all inconceivable that the 2000 is just slow to get data from the disk, but 1000 char/sec seems like a rather low achievement, even for a non-DMA disk. Both putdata and getdata use $DCLAST to start the transmitter (putdata for data packets, getdata for ACKs). In a previous version this was executed after enabling AST delivery with SETAST. I've now moved the $DCLAST inside the "protected" region, just before the SETAST. Clearly the trailblazer is hitting us with ACKs very quickly after transmission, probably right after getting each packet. (But I don't see them after each packet because I have time during the transmission of each packet to get the next packet's worth of data from the outgoing file and queue the packet for transmission. Even if the transmit window is full I can do everything but the last step.) Here is the sequence when the ACKs arrive: I receive an ACK the rcv_hdr routine is entered as an AST procedure rcv_hdr calls rcv_acknak, which releases the first packet from the window and calls WAKE rcv_hdr then queues the next read, looking for the next header the trouble seems to be that we get hit with the next ACK, and rcv_hdr is entered again, before putdata has a chance to DCLAST for xmt_start. By moving putdata's DCLAST of xmt_start inside the "ASTs are disabled" section of code, I hope to improve things a bit. If the read AST is already queued the AST for xmt_start will still have to wait behind it, but at least we will not have to go all the way back to process level and then back to the AST delivery mechanism; upon exit from the rcv_hdr AST, the AST delivery mechanism will note that another AST (that for xmt_start) is already queued to the process and can start it. Update: The uV2000 is simply swamped, with almost no idle time during transmission. Nevertheless I have moved the DCLASTs in getdata and putdata inside the "ASTs are disabled" section. There does seem to be a *very* slight improvement on the 8200. Since this seems to be the right way to do things anyway, we'll leave it in. fastack This was a set of trials in an attempt to improve performance with trailblazers. uV2000 was sending, 8200 receiving (DMF32). With the normal code, I would invariably see the ACK of each packet sometime within the reception of the next packet, in a nicely synchronized fashion, ie Data444444444444444 Data55555555555555 Data6666666666666666 Ack4 Ack5 --->ack latency<---- Except for instances where the receiving system was (apparently) busy for a bit, the "ACK latency" (the time from the beginning of Data5, for example, which is to say the end of Data 4, to the beginning of the Ack4) was about 28 milliseconds, sometimes as long as 33. (The data packets themselves measure 36.5 msec from start to start, so we are sending the ACK almost at the end, but not quite at the end, of the next packet.) The "fastack" strategy was as follows: getdata() would set a bit in the receive table when it's waiting for a packet, ie if it's running ahead of the arriving data (which we hope will be the case most of the time). rcv_data(), upon getting a packet with a good checksum and placing it in the receive table, would check this bit. If it's set, rcv_data() will bump ackreq and call xmt_start, queueing the ack for transmission without having to go all the way back to process level and then back to AST. Further, if the bit is not set, rcv_data would know that it needn't call SYS$WAKE at all, saving a bit of overhead. getdata, meanwhile, would, upon getting a packet from the table, check the ackreq coiunt. If it's nonzero it knows that rcv_data has already requested an ACK, and doesn't bother sending it. The result of this attempt was quite strange; ack latency INCREASED to about 49 msec, pushing the ACK of packet n well into the time when we were receiving n+2. I took the fastack code out and modified getdata to queue the AST to send the ACK before enabling ASTs (see setast, above). With this, ACK latency went back to 28-30 msec, but sometimes went as low as 22! I put the fastack code back in, but had rcv_data call rcv_start (to queue the read for the next packet header) before calling xmt_start (to send the ACK). Now I started getting ACKs in pairs: Data444444444444444 Data55555555555555 Data6666666666666666 Ack4 Ack5 <-- ack 4 latency --> --> <--- ack 5 latency --> <--- inter-ack time Data777777777777777 Data00000000000000 Data1111111111111111 Ack6 Ack7 These timings were obtained from hp line analyzer displays, not the debug log. The analyzer was connected between the system that was receiving the data and its modem (Trailblazer, running in PEP /gspoof mode). Throughput was about 1200 char/sec, apparently limited by how fast the system at the other end (uV2000) could send data; this showed up as occasional long inter-data-packet times. All timing discussions here are for the "bursts" of continuous transmission. The time from end of Data4 to beg of Ack4 was much worse (45 msec), but from end of Data5 to beg of Ack5 was much better (18 msec), than with the standard code. The average, 32 msec, was a bit worse than the standard code. I tried moving the xmt_starts and rcv_starts around a bit more, and in one configuration (my notes are sketchy), I must have hit a sync error -- the deadman timer expired while getdata was waiting. (Was the decision not to call sys$wake made in error?) Anyway, since this "optimization" obviously doesn't help, and introduces additional synchronization problems, I ripped it out. I think the bottom line here is that this sort of twiddling is beside the point. We might eke out a few tenths of a millisecond here or there by avoiding an extra AST delivery or SYS$WAKE call, but so what? The right way to do this is to implement the packet protocol in a terminal class driver, just as is done for DDCMP; we can then do one, not two, QIOs per read data packet, get rid of the separate QIO for typeahead buffer probe, handle all control packets within the driver, etc., etc. We might not actually run much faster but CPU loading should improve dramatically. Incidently, the time between start of ACKs (inter-ack time) was about 10 msec. Given that the ACK (six bytes) takes 3 msec of actual line time, and assuming that the ackreq for the second ack was already set, so that the second ack was transmitted when xmt_done (for the first ack) called xmt_start, this implies that it takes us 7 msec to "turn around" a qio to the terminal driver (this on an 8200). This delay only applies to acks and other control messages, for which we have but a single buffer and can only have one write queued at a time. For data messages we queue a full window's worth of write qios without waiting for any of them to finish. fulldup The "full duplex terminal driver" isn't really full-duplex. It can have a write running while a read is pending, and a write queued while a read is running. If a read is queued but no data is available, a subsequent write will happen immediately. If data arrives while the write is happening, it just gets stashed in the typeahead buffer until the write is done. As soon as the read becomes active -- which happens as soon as the first character is gotten from the mux, when there's a read qio outstanding -- subsequent write qios are blocked until the read is done. Some analysis of this would probably explain the bizarre fastack behavior, above. Should we be writing with $BRKTHRU, so we can actually move data out while we're also moving data in? No, it's more important to move data in -- starving the other guy's receiver momentarily is far better than having our typeahead buffer overrun and having to ask for a retransmit. Update: Correction -- the terminal driver book says that, with echo turned off, which is how we run things, writes aren't inhibited even when a read request is active. trailblazer_timing For this I put the analyzer between the 2000 and its modem. The analyzer showed that the trailblazer's ack latency is about 1.5 msec. uv2000_timing Even during bursts of "continuous" transmission, packets from the uV2000 to the trailblazer were separated by an 8.7 msec delay, during which the trailblazer sent the ACK for the previous packet (1.5 msec delay, plus 3 msec transmission time). We assume that, since the receiver is always enabled, reception of the ack blocked the start of transmission of the next data packet; this leaves 4.3 msec unaccounted for. I'm quite prepared to believe that the terminal driver on a 2000 can take this long to get the next write going, even if it's already queued. On the other hand it might not have been queued, there was no idle time to speak of on the machine. tfreff It can be argued that, when calculating effective thruput, we should multiply the result by pktsize/bufsize. For instance, at 2400 bps it takes 0.2917 seconds to send a 70-byte packet (data plus checksum). We could say that if we transfer a 64-byte file in this amount of time that we are getting an effective 240 byte/sec throughput. Without this fudge factor the user will never see a reported throughput of greater than about 219.4 bytes/second, leading them to wonder what we're doing wrong. BUT, if we apply the fudge factor, we will not see the results of experiments with different buffer sizes. So the actual throughut is probably a more useful number. We can document the effect of the headers so that people won't complain that they're "only" getting 217 Bps through their 2400 bps links. return pktsize / bufsize the following calculation would return link efficiency, "fudged" for the effect of the data packet headers: return 100.0 * ((bytes * 10.0) / time) / ( (float)tt_bps() * ((float)bufsize / (float)pktsize) ); oneacknak Unix uucp is very relaxed about sending NAKs -- when we send it a bad packet it typically just lets us timeout and resend. We only see NAKs when we send several bad packets in sequence. This wastes time. However, blindly sending a NAK(last_good_packet) every time we get a bad checksum, out of window, or duplicate packet isn't right either. Suppose wmax is 3 (typical) and the following sequence occurs: unix on the link us send data 0 data 0 corrupted! send data 1 rcv data 0, send NAK 7 (#1) send data 2 rcv data 1, send NAK 7 (#2) rcv data 2, send NAK 7 (#3) rcv NAK 7 (#1) (restarts transmits for everything after 7) send data 0 send data 1 rcv data 0, send ACK 0 send data 2 rcv data 1, send ACK 1 (at this point the other NAKs we sent come straggling in) rcv NAK 7 (#2) rcv data 2, send ACK 2 rcv NAK 7 (#3) (so U. restarts again) send data 0 send data 1 rcv data 0, send NAK 2 send data 2 rcv data 1, send NAK 2 rcv ACK 0 rcv data 2, send NAK 2 Things get much worse from here on. We tried sending just one NAK for a given number, but this got into trouble when the NAK itself got clobbered (and therefore not received); Unix timed out awaiting an ACK of the previous packet, and resent it, so having just sent a NAK we sent nothing back, and Unix timed out again, and... Now we are keeping track of the number of NAK requests, and we actually send the NAK only if the number modulo the window size is equal to one. ie we send the NAK the first time we detect an error, but we don't send another NAK until the fourth error for the same packet. rcvduppkt If the valid bit for a received packet is set, this is a dupe of a packet we already got (unless windowsize is five or greater, in which case it might be a future packet, with several intervening ones lost somewhere; this appears to be an ambiguous situation).* Since uucp typically only uses windowsize of 3, the right thing to do seems to be to assume that it's a dupe, and send an ack for it. But this breaks the rule that "packets are only acked in sequence", and indeed, sending an ACK for an earlier message than the one we expect confuses Unix uucp. So the most "universally correct" thing is to always send a NAK with y = the last packet in the window, as if we got the expected packet with a bad checksum. I tried retransmitting the most recent ack (rlastack) but Unix uucp didn't like it under the following error condition: I got data 6, I used it and mistakenly (never mind why) sent NAK 6 (I should have sent ACK 6) I got data 7, I used it and sent ack 7 I got data 0, I used it and sent ack 0 apparently about this time Unix got my nak 6, so he retransmitted everything in his window: I got data 7, I sent ack 0 I got data 0, I sent ack 0 I got data 1, I used it and sent ack 1 I got data 7, I sent ack 1 I got data 0, I sent ack 1 I got data 7, I sent ack 1 I got data 0, I sent ack 1 (repeat until U. gave up on me) Observations: U. obviously thought that I'd sent the NAK 6 in response to the Data 7. The NAK 6 properly served as the ACK 6 even though I never sent an ACK 6. But somehow my subsequent ack 7 got lost. (Perhaps it arrived before Unix had retransmitted the Data 7, and so was ignored). In any case, once U. started to retransmit it obviously wanted nothing to do with ACKs other than the "expected" one. I changed the code to explicitly ACK the duplicate packet, and left the code in that sent the NAK 6 instead of an ACK 6, and it recovered correctly. 04 apr 89: I got rid of the "send ACK for duplicate packet" code; Unix always recovers properly when we just send the NAK for the last good one. * Consider a windowsize of seven; U has sent 1-7 and we've acked them all. Now we see Data 1. Is it a dupe or a future packet? - If all of our acks got lost U would time out and send Data 1. - If U saw our Acks 1 and 2 U would send Data 0 and Data 1. If Data 0 got lost we'd just see Data 1. As far as I can tell, these two cases cannot be distinguished on our side. Once again, the right thing to do is to send a NAK 7, as if we'd gotten a Data 0 with a bad checksum. In the former case U. should see the NAK 7 as an ACK for 1-7, and go on to send us 0, 1, etc. In the latter, U. should respond by resending 0 and 1. readtypahd During the login script stuff, our "user" is just requesting one character at a time, so we might fall behind if the host is sending us system login announcment messages, etc; therefore we try to move stuff out of the typeahead buffer as fast as possible. We don't bother once the protocol is turned on, since the "user" then asks for the appropriate number of characters (headersize or datasize), so we'll only have to do two QIOs per packet. QIO sensemodes to read the typeahead buffer take an appreciable amount of time. It does not seem reasonable to blindly try to optimize two QIO reads into a QIO sensemode followed by a QIO read. Even with 70-byte packets at 19200 bps, we have about 18 milliseconds per read QIO (at two QIOs per packet). This is a more than adequate margin on everything but, perhaps, 730s and MicroVAX I's, so we would likely NOT end up doing just one QIO read per packet anyway, except on rare occasions; the result would be more kernel mode time, not less, for the same throughput. A possible improvement: If readlen = headersize, use readlen. But if readlen > headersize (ie we're looking for a data segment), see if the typeahead buffer has readlen + headersize characters (ie the data segment plus the next header). If yes, read the whole typeahead buffer; else just use readlen. At worst we end up doing an extra sensemode for every two reads. On second thought it is not a good idea to build any knowledge of the protocol into xgetnc. Instead we can give it another argument by which the "user" (giowvms) can specify how many characters are expected with the next read. For header reads this would be 0, for data segment reads it would be headersize. At worst we do one extra qio sensemode for every data packet, ie for every two reads. baddata It is obvious that when we get a bad header (as indicated by an XOR check failure) we can't just junk it and look for the next header. We can junk the first character, but then we have to rescan the rest for another ctl-P. The reason is that the corrupt header might be partly (but only partly) line noise, so the ctl-P denoting the start of the next real header might be found within the six characters. It is less obvious that when we get a bad data packet -- and this includes not only data packets with bad checksums, but also out-of-window and duplicate data packets -- that we must similarly rescan the data segment of the packet looking for an imbedded header. I originally thought that, since the xor on the data header passed, we can trust the K field, so it's okay to skip the data segment (typically 64 bytes) and resume looking for the next packet after that. But... Suppose the file being received is a copy, obtained from a line analyzer, of just one direction of a uucp data transfer. If we pay no attention to what's inside the data segments we see this (assuming 32-byte data segments, K = 1): HHHHHH DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD HHHHHH DDDD... but since the data contains a uucp packet stream, the data fields contain HHHHHH DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD HHHHHH DDDDDDDD.... ddddhhhhhhdddddddddddddddddddddd ddddddddddhhhhhh 1 2 3 4 5 Now, suppose that the header at 1 gets corrupted in transmission, so the XOR check fails. Fine, we junk it and start looking for another ctl-P, and we find the "imbedded header" at 2. It passes xor, so we interpret it as a data header, and read the next 32 characters. If it happens to be the right packet number it'll fail on checksum, but more likely it's the wrong packet number; whatever, we queue a NAK to tell the sender what we want, and resume scanning. We find the ctl-P at 3...and we're back in sync. If instead we had said "bad data segment, skip 32 characters" we'd be looking for the next header starting at 4. We'd then find the next imbedded header at 5.... we'd eventually get back in sync, but it might take a LONG time!