<<

Winsock Programmer's FAQ
Section 3: Intermediate Winsock Issues

>>
3.1 - How do I speak { HTTP, POP3, SMTP, FTP, Telnet, NNTP, etc. } with Winsock?

Winsock proper does not provide a way for you to speak these protocols, because it only deals with the layers underneath these application-level protocols. However, there are many ways for you to get your program to speak these protocols.

The easiest method is to use a third-party library. The Resources section lists several of these.

If you only need to speak the HTTP, FTP or gopher protocols, you can use the WinInet library exposed by Microsoft's Internet Explorer. Newer versions of Microsoft's development tools include components that make accessing WinInet simple.

Finally, you can always roll your own. You should start by reading the specification for the protocol you want to implement. Most of the Internet's protocols are documented in RFCs. The Important RFCs page links to the most commonly referenced application-level RFCs. The complexity of the protocols vary widely, and the only way to gauge the difficulty of implementing the protocol is to read the relevant RFC(s). HTTP, for example, is a pretty simple protocol, but the authors of its RFC managed to fill 176 pages talking about it. Most RFCs aren't that pretentious, luckily.

If you've read the RFC and still can't figure the protocol out, try asking on Usenet. There are many newsgroups dedicated to particular application protocols: most are in the comp.protocols.* hierarchy. Failing that, you can ask in one of the general Winsock and TCP/IP mailing lists and newsgroups.

3.2 - How can I encrypt my TCP stream with SSL/TLS?

At this time, only Windows NT derivatives and Windows CE have a generic built-in SSL mechanism. For other Windows versions, you have the option of WinInet (limited in various ways), or to get a third-party library.

Windows NT derivatives offer SSL through their security APIs. You can find sample code to show how these mechanisms work in the Windows Platform SDK. The SSL samples are underneath the Platform SDK directory in the "Samples\WinBase\Security\SSL" subdirectory.

Windows CE has a different SSL mechanism. There is an article in MSDN that describes how to use the functionality. The article also goes into the WinInet method.

WinInet is a feature in Internet Explorer version 3 and higher that lets you use some of Internet Explorer's networking functionality in your own programs. The main disadvantages of WinInet's SSL feature are that it only works with HTTP, and WinInet is not very flexible. Also, 128-bit IE is not available worldwide. MS Knowledge Base article Q168151 shows how to use this feature.

3.3 - How do I get my IP address from within a Winsock program?

There are three methods, which each have advantages and disadvantages:

  1. The simplest method is to call getsockname() on a connected socket. If you don't have a connected socket, this method will either fail or will return useless or redundant information.

  2. To get your address without opening a socket first, do a gethostbyname() on the value gethostname() returns. This will return a list of all the host's interfaces, as shown in this example. (See the example page for problems with the method.)

  3. The third method only works on Winsock 2. The new WSAIoctl() API supports the SIO_GET_INTERFACE_LIST option, and one of the bits of information returned is the addresses of each of the network interfaces in the system. [C++ Example] (Again, see the example page for caveats.)

The latter two methods above will return at least two addresses for most TCP/IP-networked machines, and sometimes more. You will usually see one entry for the "normal" network interface and one for the "loopback" network interface. Usually the "normal" network interface is a modem or an Ethernet card. The loopback interface (IP address 127.0.0.1) lets two programs running on the same machine talk to each other without involving the operating system's network hardware layer; talking to the loopback interface is at least as fast as talking to the normal network interface, and on some network stacks it's a lot faster.

It's possible to have more than the two network interfaces on a system. Many servers, for example, have two or more network interface cards, so they will show three or more entries with methods 2 and 3 above. A more complex example is a satellite Internet router, which has a modem connection for uplink to the Internet, the satellite adapter for downlink from the Internet, an Ethernet card for talking to the rest of the LAN, and of course the loopback interface.

If you're simply trying to connect to a server running on the same machine with sockets, use the loopback interface. If instead you have to intelligently pick one of the normal interfaces, there is no programmatic method that works for all purposes. For many programs, method 1 above is sufficient, because it returns the IP address that an existing connection is using. If that doesn't work for you, then you will probably just have to present the list of interfaces to the user and make them pick one.

Sometimes you have a more exacting criterion, like trying to find the PPP interface's address. Method 3 above will work for you because one of the bits of info you get with it is a flag on the PPP interface telling you that it's a "point to point" interface.

3.4 - What's the proper way to impose a packet scheme on a stream protocol like TCP?

The two most common methods are delimiters and length-prefixing.

An example of delimiters is separating packets with, say, a caret (^). Naturally your delimiter must never occur in regular data, or you must have some way of "escaping" delimiter characters.

An example of length-prefixing is prepending a two-byte integer containing the packet length on every packet. See the FAQ article How to Use TCP Effectively for the proper way to send integers over the network. Also see the How to Packetize a TCP Stream example.

There are hybrid methods, too. The HTTP protocol, for example, separates header lines with CRLF pairs (a kind of delimiting), but when an HTTP reply contains a block of binary data, the sever also sends the Content-length header before it sends the data, which is a kind of length-prefixing.

I favor simple length-prefixing, because as soon as you read the length prefix, you know how many more bytes to expect. By contrast, delimiters require that you blindly read until you find the end of the packet.

3.5 - I'm writing a server. What's a good network port to use?

If you're writing a server for an existing, popular Internet protocol, it's already got a port number assigned to it. You can find the most common of these numbers at the website for the Internet Assigned Numbers Authority (IANA).

If you're writing a server for a new protocol, there are a few rules and suggestions you should obey when choosing your server's port:

  1. Ports 1-1023 are off-limits to people inventing new protocols. They are reserved by the IANA for standard protocols like POP3 and HTTP (110 and 80, respectively). Until your protocol is granted a port in this range by the IANA, you should use something outside this range. id Software's choice of port 666 for their DOOM game server is cute, but it violates this rule. They cleaned up their act with Quake: it uses port 6112.

  2. Ports 1024 through 49151 are Registered Ports, which are a good range to choose your ports from. Just beware that the entire world is choosing from ports in this range, so it may make sense for you to register your port, or at least check the current list of assigned ports.

  3. Ports 49152 through 65535 are Dynamic Ports, meaning that operating systems use ports in this range when choosing random ports. (The FTP protocol, for example, uses random ports in the data transfer phase.) This is a poor range to choose ports from, because there's a fairly decent chance that your program and the OS will fight over a given port eventually.

  4. Many OSes pick local ports for client programs from the 1024-5000 range. You would do well to pick server ports higher than 5000, but this is not as rigid a rule as the previous ones.

  5. There are plenty of uncontested port numbers to choose from in the "safe" 5000-49151 range. You should avoid port numbers with patterns to them, or a widely-recognized meaning. People tend to pick these since they're easy to remember, but this increases the chances of a collision. Ports 6969, 5150 and 22222 are bad choices, for example.

You should also give some thought to making your program's port configurable, in case your program is run on a machine where another server is already using that port. One way to do this is through Winsock's getservbyname() function: if that function returns a port number, use that, otherwise use the default port number. Then users can change your program's port by editing the SERVICES file, located in %WINSYSDIR%\DRIVERS\ETC on Windows NT derivatives and c:\Windows on Windows 95 derivatives.

3.6 - What is TCP?

The Transmission Control Protocol is a reliable stream protocol. "Reliable" means that Winsock always succeeds in sending the data to the remote peer: TCP can deal with lost, corrupted, duplicated and fragmented packets. "Stream" means that the remote peer sees incoming data as a stream of individual bytes: there is no notion of packets, from the program's viewpoint.

Winsock gives you a TCP socket when you pass SOCK_STREAM as the second argument to socket().

TCP can coalesce sends, for efficiency: if you make four quick send() calls to Winsock with 100, 50, 30 and 120 bytes in each, Winsock is likely to pack all these up into a single 300-byte TCP packet when it decides to send them out on the network. (This is called the Nagle algorithm.) Compare UDP.

3.7 - What is UDP?

The User Datagram Protocol is an alternative to TCP. Sometimes you see the term "TCP/IP" used to refer to all basic Internet technologies, including UDP, but the proper term is UDP/IP, meaning UDP over IP.

Winsock gives you a UDP socket when you pass SOCK_DGRAM as the second argument to socket().

UDP is an "unreliable" protocol: the stack does not make any effort to handle lost, duplicated, or out-of-order packets. UDP packets are checked for corruption, but a corrupt UDP packet is simply dropped silently.

The stack will fragment a UDP datagram when it's larger than the network's MTU. The remote peer's stack will reassemble the complete datagram from the fragments before it delivers it to the receiving application. If a fragment is missing or corrupted, the whole datagram is thrown away. This makes large datagrams impractical: an 8K UDP datagram will be broken into 6 fragments when sent over Ethernet, for example, because it has a 1500 byte MTU. If any of those 6 fragments is lost or corrupted, the stack throws away the entire 8K datagram.

Datagram loss can also occur within the stack at the sender or the receiver, usually due to lack of buffer space. It is even possible for two communicating programs running on the same machine to have data loss if they use UDP. (This actually happens on Windows under high load conditions, because it starts dropping datagrams when the stack buffers get full.) This limits UDP's value as a local IPC mechanism.

If any of these types of loss occur, no notification will be sent to the sender or receiver, even if the loss happens within the network stack.

Duplicated datagrams are not dropped: they are delivered to the receiver. It is up to the application to detect this problem, and it is the program's choice what to do with the duplicate datagram.

UDP datagrams can be delivered in any order. Datagrams often get reordered on the network when two datagrams get delivered via different routes, and the second datagram's route happens to be quicker.

3.8 - What is UDP good for?

From the above discussion, UDP looks pretty useless, right? Well, it does have a few advantages over reliable protocols like TCP:

  1. UDP is a slimmer protocol: its protocol header is fixed at 8 bytes, whereas TCP's is 20 bytes at minimum and can be more.

  2. UDP has no congestion control and no data coalescing. This eliminates the delays caused by the delayed ACK and Nagle algorithms. (This is also a disadvantage in many situations, of course.)

  3. There is less code in the UDP section of the stack than the TCP section. This means that there is less latency between a packet arriving at the network card and being delivered to the application.

  4. Only UDP packets can be broadcast or multicast.

This makes UDP good for applications where timeliness and control is more important than reliability. Also, some applications are inherently tolerant of UDP problems: data loss in a streaming video program just means a frame or two is dropped.

Be careful not to let UDP's advantages blind you to its bad points: too many application writers have started with UDP, and then later been forced to add reliability features. When considering UDP, ask yourself whether it would be better to use TCP from the start than to try to reinvent it. Note that you can't completely reinvent TCP from the Winsock layer. There are some features of TCP like path MTU discovery that require low-level access to the OS's networking layers. Other features of TCP are possible to duplicate over UDP, but difficult to get right. Keep in mind, TCP/IP has been around for about a quarter of a century now. A whole lot of effort has gone into tuning this protocol suite for reliability and performance.

If you need a balance between UDP and TCP, you might investigate RTP (RFC 1889) and SCTP (RFC 2960). RTP is a higher level prototocol that usually runs over UDP and adds packet sequence numbers, as well as other features. SCTP runs directly on top of IP like TCP and UDP; it is a reliable protocol like TCP, but is datagram oriented like UDP.

3.9 - How do I send a broadcast packet?

With the UDP protocol you can send a packet so that all workstations on the network will see it. (TCP doesn't allow broadcasting.)

To send broadcast packets, you must first enable the SO_BROADCAST option with the setsockopt() function. Then you simply send packets out using a special broadcast address.

The universal broadcast address is 255.255.255.255. Its advantage is that it's generic. The disadvantage is that, because it can theoretically refer to every IP-connected machine on the planet, many network nodes will drop universal broadcast packets.

A smarter plan is to use your subnet's "directed broadcast" address. This is an address you calculate using a network interface's IP address and its netmask; packets sent to that address will stay within the subnet, so often routers that would drop a universal broadcast will pass directed broadcasts. To construct the directed broadcast address, do something like this:

                u_long host_addr = inet_addr("172.16.77.88");   // local IP addr
                u_long net_mask = inet_addr("255.255.224.0");   // LAN netmask
                u_long net_addr = host_addr & net_mask;         // 172.16.64.0
                u_long dir_bcast_addr = net_addr | (~net_mask); // 172.16.95.255

Potential Problems: Broadcasts can be useful at times, but keep in mind that this creates a load on all the machines on the network, even on machines that aren't listening for the packet. This is because the part of the stack that can reject the packet is several layers down. To get around this problem, you may want to consider multicasting instead.

3.10 - Is Winsock thread-safe?

The Winsock specification does not mandate that a Winsock implementation be thread-safe, but it does allow an implementor to create a thread-safe version of Winsock.

Bob Quinn says, on this subject:

  • "WinSock, any implementation, is thread safe if the WinSock implementation developer makes it so (it doesn't just happen)."
  • "I don't know of any implementations from Microsoft (or any other vendors) that are not thread safe."
  • "If a WinSock application developer creates a multi-threaded application that shares sockets among the threads, it is that developer's responsibility to synchronize activities between the threads."

By "synchronize activities", I believe Bob means that it may cause problems if, for example, two threads repeatedly call send() on the same socket. There is no guarantee in the Winsock specification about how the data will be interleaved in this situation. Similarly, if one thread calls closesocket() on a socket, it must somehow signal other threads using that socket that the socket is now invalid.

Anecdotal evidence suggests that one thread calling send() and another thread calling recv() on a single socket is safe on recent Microsoft stacks at least.

Instead of multiple threads accessing a single socket, you may want to consider setting up a pair of network I/O queues. Then, give one thread sole ownership of the socket: this thread sends data from one I/O queue and enqueues received data on the other. Then other threads can access the queues (with suitable synchronization).

Applications that use some kind of non-synchronous socket typically have some I/O queue already. Of particular interest in this case is overlapped I/O or I/O completion ports, because these I/O strategies are also thread-friendly. You can tell Winsock about several OVERLAPPED blocks, and Winsock will finish sending one before it moves on to the next. This means you can keep a chain of these OVERLAPPED blocks, each perhaps added to the chain by a different thread. Each thread can also call WSASend() on the block they added, making your main loop simpler.

3.11 - If two threads in an application call recv() on a socket, will they each get the same data?

No. Winsock does not duplicate data among threads.

Note that if you do call recv() at the same time on a single socket from two different threads, havoc may result. See the previous question for more info.

3.12 - Is there any way for two threads to be notified when something happens on a socket?

No. If two threads call WSAAsyncSelect() on a single socket, only the thread that made the last call to WSAAsyncSelect() will receive further notification messages. Similarly, if two threads call WSAEventSelect() on a socket, only the event object used in the last call will be signaled when an event occurs on that socket. You also can't call WSAAsyncSelect() on a socket in one thread and WSAEventSelect() on that same socket in another thread, because the calls are mutually exclusive for any single socket. Finally, you cannot reliably call select() on a single socket from two threads and get the same notifications in each, because one thread could clear or cause an event, which would change the events that the other thread sees.

3.13 - How do I detect if there is an Internet connection?

It is sometimes useful for a Winsock program to only do its thing if the computer is already connected to the Internet. In many cases, "connected to the Internet" means having a dial-up networking connection. See this example for code that checks for such a connection.

This doesn't work in all situations, however. The first problem is, not everyone uses a modem to connect to the Internet. Often a computer is hooked to a LAN, and one of the stations on the LAN acts as a gateway to the Internet. You could poke around in the system's network configuration to see if they have a gateway configured, but then you run into the problem that gateways are used for things other than simply connecting a LAN to the Internet. Even if the LAN is sometimes gatewayed to the Internet, the gateway's Internet connection might not always be up, or it might be configured to block access to some sites.

Another issue is that even if the PC does have a modem for connecting to the Internet, it might be disconnected but configured to auto-dial. In this case, the fact that the modem is currently disconnected is not a problem: your program should blindly try to connect, which will bring the connection up.

The moral of the story is, it's usually best not to even check for an Internet connection. Simply assume that the user knows what they're doing by launching your program. Try the connection, and if it fails because there is no Internet connection, you can tell the user about it and leave fixing the problem up to the user. You might also consider making your program's connection handling user-configurable: let the user tell you whether it's correct to check for a dial-up networking connection or not, and whether your program should blindly try the connection or not. Often the user knows more about their system than your program can guess.

3.14 - How can I get the local user name?

Use the Win32 function GetUserName(). [C++ Example].

3.15 - Windows 9x's Dial Up Networking keeps popping up an automatic dial window, even when it isn't necessary. Can I make it stop?

On some PCs running a Windows 95 derivative, Dial Up Networking (DUN) sometimes pops up an automatic-dial window even when it is obviously not required. The most common time this happens is when the machine has both a LAN adapter and a modem for connecting to the Internet.

The most common trigger for the DUN dial window is a Winsock program calling the gethostbyname() function, which initiates a DNS lookup. Even if the name is that of a LAN machine and there's a DNS server on the LAN, DUN will still try to bring up the Internet link to try that first. This problem is due to limitations in Win9x's ability to handle multiple network interfaces.

The best solution is to just use straight IP addresses, and write your programs to recognize an IP address, so they don't have to call gethostbyname().

3.16 - I've heard that asynchronous sockets are unreliable. Is this true?

Asynchronous sockets are reliable if your program obeys the letter of the Winsock specification.

Every so often, you hear stories about a program that loses asynch notification messages. As far as I can tell, it's always due to a bug in the complainer's program, due to misunderstanding Winsock's parsimonious notification policy.

Consider the FD_WRITE notification. That only gets sent when a client's connection is accepted by the remote peer, and from then on only when output buffer space becomes available after Winsock gives you a WSAEWOULDBLOCK error. To put it another way, FD_WRITE only gets sent to say, "Before now, it was not okay to write data on this socket; now it's okay." The conservative way to handle this is to always try to send data when you have it, whether you've received an FD_WRITE or not. You might get a WSAEWOULDBLOCK error, but that's harmless and easy to handle. Your handler for FD_WRITE then just tries to send everything queued up until it sends it all or gets another WSAEWOULDBLOCK.

Win16 message queues are fixed-length and fairly short, so it is at least possible to lose notifications in 16-bit programs. If Winsock fails to send you a notification because the message queue is full, it is supposed to keep trying, but empirical evidence suggests that this does not always happen. Keep in mind that when we speak of "16-bit Winsock" we're talking about stacks from a dozen different vendors, each with many versions spanning many years.

I've been using asynchronous sockets almost exclusively for many years now with no problems. Others who've been using asynchronous notification for years longer than I have agree. If you believe you're losing notifications, you have to ask yourself whether it's more likely that we've overlooked a bug in the stack or that there's a bug in your program.

3.17 - What is the Nagle algorithm?

The Nagle algorithm is an optimization to TCP that makes the stack wait until all data is acknowledged on the connection before it sends more data. The exception is that Nagle will not cause the stack to wait for an ACK if it has enough enqueued data that it can fill a network frame. (Without this exception, the Nagle algorithm would effectively disable TCP's sliding window algorithm.) For a full description of the Nagle algorithm, see RFC 896.

So, you ask, what's the purpose of the Nagle algorithm?

The ideal case in networking is that each program always sends a full frame of data with each call to send(). That maximizes the percentage of useful program data in a packet.

The basic TCP and IPv4 headers are 20 bytes each. The worst case protocol overhead percentage, therefore, is 40/41, or 98%. Since the maximum amount of data in an Ethernet frame is 1500 bytes, the best case protocol overhead percentage is 40/1500, less than 3%.

While the Nagle algorithm is causing the stack to wait for data to be ACKed by the remote peer, the local program can make more calls to send(). Because TCP is a stream protocol, it can coalesce the data in those send() calls into a single TCP packet, increasing the percentage of useful data.

Imagine a simple Telnet program: the bulk of a Telnet conversation consists of sending one character, and receiving an echo of that character back from the remote host. Without the Nagle algorithm, this results in TCP's worst case: one byte of user data wrapped in dozens of bytes of protocol overhead. With the Nagle algorithm enabled, the TCP stack won't send that one Telnet character out until the previous characters have all been acknowledged. By then, the user may well have typed another character or two, reducing the relative protocol overhead.

This simple optimization interacts with other features of the TCP protocol suite, too:

  • Most stacks implement the delayed ACK algorithm: this causes the remote stack to delay ACKs under certain circumstances, which allows the local stack a bit of time to "Nagle" some more bytes into a single packet.

  • The Nagle algorithm tends to improve the percentage of useful data in packets more on slow networks than on fast networks, because ACKs take longer to come back.

  • TCP allows an ACK packet to also contain data. If the local stack decides it needs to send out an ACK packet and the Nagle algorithm has caused data to build up in the output buffer, the enqueued data will go out along with the ACK packet.

The Nagle algorithm is on by default in Winsock, but it can be turned off on a per-socket basis with the TCP_NODELAY option of setsockopt(). This option should not be turned off except in a very few situations.

Beware of depending on the Nagle algorithm too heavily. send() is a kernel function, so every call to send() takes much more time than for a regular function call. Your application should coalesce its own data as much as is practical to minimize the number of calls to send().

3.18 - When should I turn off the Nagle algorithm?

Almost never.

Inexperienced Winsockers usually try disabling the Nagle algorithm when they are trying to impose some kind of packet scheme on a TCP data stream. That is, they want to be able to send, say, two packets, one 40 bytes and the other 60, and have the receiver get a 40-byte packet followed by a separate 60-byte packet. (With the Nagle algorithm enabled, TCP will often coalesce these two packets into a single 100 byte packet.) Unfortunately, this is futile, for the following reasons:

  1. Even if the sender manages to send its packets individually, the receiving TCP/IP stack may still coalesce the received packets into a single packet. This can happen any time the sender can send data faster than the receiver can deal with it.

  2. Winsock Layered Service Providers (LSPs) may coalesce or fragment stream data, especially LSPs that modify the data as it passes.

  3. Turning off the Nagle algorithm in a client program will not affect the way that the server sends packets, and vice versa.

  4. Routers and other intermediaries on the network can fragment packets, and there is no guarantee of "proper" reassembly with stream protocols.

  5. If a packet arrives that is larger than the available space in the stack's buffers, it may fragment a packet, queuing up as many bytes as it has buffer space for and discarding the rest. (The remote peer will resend the remaining data later.)

  6. Winsock is not required to give you all the data it has queued on a socket even if your recv() call gave Winsock enough buffer space. It may require several calls to get all the data queued on a socket.

Aside from these problems, disabling the Nagle algorithm almost always causes a program's throughput to degrade. The only time you should disable the algorithm is when some other consideration, such as packet timing, is more important than throughput.

Often, programs that deal with real-time user input will disable the Nagle algorithm to achieve the snappiest possible response, at the expense of network bandwidth. Two examples are X Window servers and multiplayer network games. In these cases, it is more important that there be as little delay between packets as possible than it is to conserve network bandwidth.

For more on this topic, see the Lame List and the FAQ article How to Use TCP Effectively.

3.19 - What is TCP's sliding window?

In a naïve implementation of TCP, every packet is immediately acknowledged with an ACK packet. Until the ACK arrives from the receiver (in this naïve implementation, at any rate), the sender does not send another packet. If the ACK does not arrive within some particular time frame, the sending stack retransmits the packet.

The problem with this is that all that waiting limits network throughput drastically. The minimum time between packets with such a scheme must be at least twice the minimum round trip time for that network, for the time to send the packet and for the time for the receiver to send back an ACK. Add in processing time on each end, temporary hardware faults (e.g. Ethernet collisions), retransmissions, routing delays, and who knows what else: the stacks end up spending more time waiting for ACKs than sending data. This is a problem because it means you can't effectively fill a network pipe with a single socket.

The limit of data throughput over a network link is the maximum amount of data it is possible to have in transit at once divided by the round trip time. Imagine a naive TCP/IP implementation running over a 100BaseT Ethernet. The maximum payload size for TCP over Ethernet is 1460 bytes, and the 100BaseT round trip time is roughly 0.3 ms. 1460 divided by 0.0003 seconds comes out to 4.8 MB/s. If you've done any speed testing on a 100BaseT Ethernet, you know you can hit 6 MB/s easily, 9 MB/s with switched Ethernet, and with good hardware and software you can approach the theoretical maximum of 12.5 MB/s. That's two to three times the data rate we calculated above. We owe that speed jump to TCP's "sliding window".

A sliding window means that the stack can have several unacknowledged packets "in flight" before it stops and waits for the remote peer to acknowledge the first packet. When the TCP connection is established, the stacks tell each other how much buffer space they've allocated for this connection: this is the maximum window size. Since each peer knows how big the remote peer's buffer is and how many unacknowledged bytes it has sent, it will stop sending data when it calculates that the remote peer's buffer is full. Each peer then sends window size updates in each ACK packet, telling the remote peer that stack buffer space has become available.

Aside: "Why is it called a sliding window," you ask? Imagine a TCP data stream as a long line of bytes. The sliding window is how the sender sees the receiver's buffer: as a fixed-size "window" sliding along the stream of bytes. One edge of the window is between the last byte the receiver has read and the next byte to be read, and the other edge is between the last byte in the receiver's input buffer and the first byte to be sent from the sender's output buffer. As the receiver reads bytes out of the network buffers, the window slides down the stream; any time it slides into the sender's buffer, the sender sends more data to fill up the window.

In Microsoft Winsock stacks, the sliding window defaults to 8 KB. That means that if it sends 8 KB of data without receiving an acknowledgement for the first packet, the stack won't send any more data until the first packet is acknowledged or the retry timer goes off, at which point it will try to send the first packet again. As each packet at the front of the "window" gets acknowledged, the 8 KB window "slides" along the data stream, allowing the remote peer to send more data.

Dividing Microsoft's 8 KB value by 0.0003 seconds gives about 26 MB/s, which means you hit the medium's maximum data rate (~12 MB/s) before you hit the limit imposed by the round trip time.

Some networks have long round trip times which require large TCP windows if your application needs to be able to fill the entire pipe with a single TCP stream. Satellite systems are the most common example of this: the minimum round trip time we see on our satellite Internet connection at work is about 600ms! Some DSL systems have pretty long round trip times, too, though not nearly as bad as satellite systems. You need to run the numbers to find out what the situation is for your system.

For what it's worth, typical modem round trip times are in the 100-250 ms range. Calculating for 250 ms comes out to 32 KB/s, about five times the data rate of the fastest modem connections you're likely to see. In other words, an 8 KB window is plenty large for modems, despite the long round trip times.

The MS Knowledge Base has articles that show how to change the TCP window size for Windows NT derivatives (Q120642) and Windows 95 derivatives (Q158474).

See the next two items for related discussion.

3.20 - What is the silly window syndrome?

The silly window syndrome results when the sender can send data faster than the reciever can handle it, and the receiver calls recv() with very small buffer sizes.

The fast sender will quickly fill the receiver's TCP window. The receiver then reads N bytes, N being a relatively small number compared to the network frame size. A naïve stack will immediately send an ACK to the sender to tell it that there are now N bytes available in its TCP window. This will cause the sender to send N bytes of data; since N is smaller than the frame size, there's relatively more protocol overhead in the packet compared to a full frame. Because the receiver is slow, the TCP window stays very small, and thus hurts throughput because the ratio of protocol overhead to application data goes up.

The solution to this problem is the delayed ACK algorithm. This causes the window advertisement ACK to be delayed a bit, hopefully allowing the slow receiver to read more of the enqueued data before the ACK goes out. This results in a larger window advertisement, so the fast sender can send more data in a single frame.

Note that the delayed-ACK solution doesn't mean your program can safely use small recv() buffers. You should still read as much as is reasonable in a single call, if only to minimize the number of context switches between kernel and user space.

3.21 - What is the delayed ACK algorithm?

In a simpleminded implementation of TCP, every data packet that comes in is immediately acknowledged with an ACK packet. (ACKs help to provide the reliability TCP promises.)

In modern stacks, ACKs are delayed for a short time (up to 200ms, typically) for three reasons: a) to avoid the silly window syndrome; b) to allow ACKs to piggyback on a reply frame if one is ready to go when the stack decides to do the ACK; and c) to allow the stack to send one ACK for several frames, if those frames arrive within the delay period.

The stack is only allowed to delay ACKs for up to 2 frames of data.

3.22 - What platform should I deploy my server on?

Assuming that you've decided to use Windows, your only real choice for handling high loads is one of the Server class versions of Windows.

It has been shown that Windows NT Workstation uses an identical kernel to NT Server. However, at startup time, NT Workstation's kernel cripples itself with respect to NT Server's run-time behavior. The same thing happens on the Win2K variants. More recently, Microsoft has completely separated their personal and server operating systems with Windows XP on the one side and Windows 2003 Server on the other.

The most important difference is that the connection backlog on the workstation-class OSes is limited to 5 slots. This means that your program has to call accept() fast enough that not more than 5 connections build up in the network stack's connection backlog. The stack rejects new connections as long as the queue is full. For a well-written server, this is not normally a problem, but it does mean that a concerted attack (a SYN flood, for example) can fill the queue, denying service to legitimate users. The server-class OSes have much higher connection backlog limits and also have features specifically designed to minimize the impact of a SYN attack.

A less important difference from a practical standpoint is that the EULA for Microsoft's workstation-class operating systems prohibit running a program that handles more than than 10 connections concurrently. I don't know of any recent version of Windows that enforces this limit in the kernel.

The Windows 95 derivatives are also unsuitable for use as servers, for a number of reasons:

  1. They share the 5-slot backlog limit of the workstation-class Windows NT derivatives.

  2. The performance of their stacks are objectively inferior to those in the NT derivatives. Simple tests to show this are timing the connection accept time and throughput of a single connection. It gets worse as the number of concurrent connections goes up.

  3. Their kernels are much less stable.

  4. Their kernels lack overlapped I/O support. (It's emulated out in user space.)

  5. I/O completion ports are completely missing.

  6. The networking subsystem doesn't handle multiple network cards very well.

<< Information for New Winsockers
Advanced Winsock Issues >>

Updated Fri Mar 24 2006 04:12 MST   Go to my home page