D    TCP Specific Programming Information

This appendix contains information about performance aspects of the Transport Control Protocol (TCP).

It discusses how programs can influence TCP throughput by controlling the following via socket options:

D.1    TCP Throughput and Window Size

TCP throughput depends on the transfer rate, which is the rate at which the network can accept packets, and the round-trip time, which is the delay between the time a TCP segment is sent and the time an acknowledgement arrives for that segment. These factors determine the amount of data that must be buffered (the window) prior to receiving acknowledgment to obtain maximum throughput on a TCP connection.

If the transfer rate or the round-trip time or both is high, the default window size used by TCP may be insufficient to keep the pipe fully loaded. Under these circumstances, TCP throughput can be limited because the sender is required to stall until acknowledgements for prior data are received.

The receive socket buffer size determines the maximum receive window for a TCP connection. The transfer rate from a sender can also be limited by the send socket buffer size. The default value is 61440 bytes for TCP send and receive buffers.

D.1.1    Programming the TCP Socket Buffer Sizes

An application can override the default TCP send and receive socket buffer sizes by using the setsockopt system call and specifying the SO_SNDBUF and SO_RCVBUF options, prior to establishing the connection. The largest size that can be specified with the SO_SNDBUF and SO_RCVBUF options is limited by the kernel variable sb_max. See Section D.1.2.1 for information about increasing this value.

For maximum throughput, send and receive socket buffers on both ends of the connection should be of equal size.

When writing programs that use the setsockopt system call to change a TCP socket buffer size (SO_SNDBUF, SO_RCVBUF), note that the actual socket buffer size used for a TCP connection can be larger than the specified value. This situation occurs when the specified socket buffer size is not a multiple of the TCP Maximum Segment Size (MSS) to be used for the connection.

TCP determines the actual size, and the specified size is rounded up to the nearest multiple of the negotiated MSS. For local network connections, the MSS is generally determined by the network interface type and its maximum transmission unit (MTU).

D.1.2    TCP Window Scale Option

Tru64 UNIX implements the TCP window scale option, as defined in RFC 1323: TCP Extensions for High Performance. The TCP window scale option, which allows larger windows to be used, was designed to increase throughput of TCP over high bandwidth, long delay networks. This option may also increase throughput of TCP in local Gigabit Ethernet and FDDI networks.

The window field in the TCP header is 16 bits. Therefore, the largest window that can be used without the window scale option is 2**16 (64KB). When the window scale option is used between cooperating systems, windows up to (2**30)-1 bytes are allowed. The option, transmitted between TCP peers at the time a connection is established, defines a scale factor which is applied to the window size value in each TCP header to obtain the actual window size.

The maximum receive window, and therefore the scale factor offered by TCP during connection establishment, is determined by the maximum receive socket buffer space.

If the receive socket buffer size is greater than 65535 bytes, during connection establishment, TCP will specify the Window Scale option with a scale factor based on the size of the receive socket buffer. Both systems involved in the TCP connection must send the Window Scale option in their SYN segments for window scaling to occur in either direction on the connection. As stated previously, for maximum throughput, send and receive buffers on both ends of the connection should be of equal size.

D.1.2.1    Increasing the System Socket Buffer Size Limit

The sb_max kernel attribute for the Socket kernel subsystem limits the amount of socket buffer space that can be allocated for each send and receive buffer. The current default is 1048576 bytes (1MB) but optionally you can increase it.

For local Gigabit Ethernet connections, the current value is sufficient. For long delay, high bandwidth paths, values greater than 1MB may be required.

To change the sb_max kernel attribute in the kernel currently in memory, use either the dxkerneltuner utility or the sysconfig -r command. See dxkerneltuner(8) or sysconfig(8), respectively, for more information.

D.2    TCP Performance and Error Recovery

TCP relies on acknowledgements to determine if packets arrive at their destination. In high-speed connections (for example, Gigabit Ethernet) that use large windows, the default mechanism can seriously affect throughput.

By default, if a packet is lost, TCP retransmits that packet and all packets after it. An application can override the default by using the setsockopt system call specifying the TCP_SACKENA option , prior to establishing the connection. After the option is agreed upon, the data receiver can inform the sender about all segments that have arrived successfully. In this way, the sender need retransmit only those segments that have actually been lost. This option is useful in cases where multiple segments are dropped.

D.3    TCP Performance and Round-Trip Measurement

TCP bases its round-trip time measurements on a only one packet per window. In high-speed connections (for example, Gigabit Ethernet) that use large window, it is possible for the round-trip time estimates to be seriously flawed, resulting in many retransmissions.

By default, TCP does not send time stamps in the TCP header. An application can override the default by using the setsockopt system call specifying the TCP_TSOPTENA option, prior to establishing the connection. After the option is selected, the sender places a timestamp in each data segment. The receiver, if configured to accept them, sends these timestamps back in ACK segments. This provides the sender with a reliable mechanism with which to measure round-trip time.

D.4    TCP Reliability and Sequence Numbers

TCP relies on sequence numbers to determine the correct sequencing of packets and to determine if duplicate packets have been received. In high-speed connections (for example, Gigabit Ethernet), it is possible for the sequence numbers to wrap around. This means that two packets could have the same sequence number yet contain different information; they are not duplicate but TCP will assume that they are.

By default, TCP does not provide a mechanism for rejecting old duplicate packets. An application can override the default by using the setsockopt system call specifying the TCP_PAWS option, after specifying the TCP_TSOPTENA option, and prior to establishing the connection. When the PAWS (Protect Against Wrapped Sequence numbers) option is enabled, the receiver rejects any old duplicate segments that are received. This option is used on synchronized TCP connections only.