3.1 - How do I speak { HTTP, POP3, SMTP, FTP, Telnet, NNTP, etc. } with Winsock?
Winsock proper does not provide a way for you to speak these protocols,
because it only deals with the layers underneath these application-level
protocols. However, there are many ways for you to get your program to
speak these protocols.
The easiest method is to use a third-party library. The
Resources section lists several of
these.
If you only need to speak the HTTP, FTP or gopher protocols, you can
use the WinInet library exposed by Microsoft's Internet Explorer. Newer
versions of Microsoft's development tools include components that make
accessing WinInet simple.
Finally, you can always roll your own. You should start by reading
the specification for the protocol you want to implement. Most of
the Internet's protocols are documented in RFCs. The
Important RFCs page links to the most commonly
referenced application-level RFCs. The complexity of the protocols vary
widely, and the only way to gauge the difficulty of implementing the
protocol is to read the relevant RFC(s). HTTP, for example, is a pretty
simple protocol, but the authors of its RFC managed to fill 176 pages
talking about it. Most RFCs aren't that pretentious, luckily.
If you've read the RFC and still can't figure the protocol
out, try asking on Usenet. There are many newsgroups dedicated to
particular application protocols: most are in the comp.protocols.*
hierarchy. Failing that, you can ask in one of the general
Winsock and TCP/IP mailing lists
and newsgroups.
3.2 - How can I encrypt my TCP stream with SSL/TLS?
At this time, only Windows NT derivatives and Windows CE have a
generic built-in SSL mechanism. For other Windows versions, you have
the option of WinInet (limited in various ways), or to get a third-party
library.
Windows NT derivatives offer SSL through their security APIs. You
can find sample code to show how these mechanisms work in the Windows
Platform SDK. The SSL samples are underneath the Platform SDK
directory in the "Samples\WinBase\Security\SSL" subdirectory.
Windows CE has a different SSL mechanism. There is an
article in MSDN that describes how to use the functionality. The
article also goes into the WinInet method.
WinInet is a feature in Internet Explorer version 3 and higher that
lets you use some of Internet Explorer's networking functionality in
your own programs. The main disadvantages of WinInet's SSL feature are
that it only works with HTTP, and WinInet is not very flexible. Also,
128-bit IE is not available worldwide. MS Knowledge Base article
Q168151 shows how to use this feature.
3.3 - How do I get my IP address from within a Winsock program?
There are three methods, which each have advantages and disadvantages:
- The simplest method is to call
getsockname() on a
connected socket. If you don't have a connected socket, this
method will either fail or will return useless or redundant
information.
- To get your address without opening a socket first,
do a
gethostbyname() on the value gethostname()
returns. This will return a list of all the host's
interfaces, as shown in this
example. (See the example page for problems with the
method.)
- The third method only works on Winsock
2. The new
WSAIoctl() API supports the
SIO_GET_INTERFACE_LIST option, and one of the bits
of information returned is the addresses of each of the network
interfaces in the system. [C++ Example]
(Again, see the example page for caveats.)
The latter two methods above will return at least two addresses for
most TCP/IP-networked machines, and sometimes more. You will usually see
one entry for the "normal" network interface and one for the "loopback"
network interface. Usually the "normal" network interface is a modem or
an Ethernet card. The loopback interface (IP address 127.0.0.1) lets two
programs running on the same machine talk to each other without involving
the operating system's network hardware layer; talking to the loopback
interface is at least as fast as talking to the normal network interface,
and on some network stacks it's a lot faster.
It's possible to have more than the two network interfaces on a
system. Many servers, for example, have two or more network interface
cards, so they will show three or more entries with methods 2 and 3
above. A more complex example is a satellite Internet router, which has
a modem connection for uplink to the Internet, the satellite adapter for
downlink from the Internet, an Ethernet card for talking to the rest of
the LAN, and of course the loopback interface.
If you're simply trying to connect to a server running on the same
machine with sockets, use the loopback interface. If instead you have to
intelligently pick one of the normal interfaces, there is no programmatic
method that works for all purposes. For many programs, method 1 above is
sufficient, because it returns the IP address that an existing connection
is using. If that doesn't work for you, then you will probably just have
to present the list of interfaces to the user and make them pick one.
Sometimes you have a more exacting criterion, like trying to find
the PPP interface's address. Method 3 above will work for you because
one of the bits of info you get with it is a flag on the PPP interface
telling you that it's a "point to point" interface.
3.4 - What's the proper way to impose a packet scheme on a stream protocol like TCP?
The two most common methods are delimiters and length-prefixing.
An example of delimiters is separating packets with, say, a caret
(^). Naturally your delimiter must never occur in regular data, or you
must have some way of "escaping" delimiter characters.
An example of length-prefixing is prepending a two-byte integer
containing the packet length on every packet. See the FAQ article
How to Use TCP Effectively for the proper way to send integers over the
network. Also see the How to Packetize a TCP Stream example.
There are hybrid methods, too. The HTTP protocol, for example,
separates header lines with CRLF pairs (a kind of delimiting), but when
an HTTP reply contains a block of binary data, the sever also sends the
Content-length header before it sends the data, which is a kind
of length-prefixing.
I favor simple length-prefixing, because as soon as you read the length
prefix, you know how many more bytes to expect. By contrast, delimiters
require that you blindly read until you find the end of the packet.
3.5 - I'm writing a server. What's a good network port to use?
If you're writing a server for an existing, popular Internet
protocol, it's already got a port number assigned to it. You can
find the most common of these numbers at the website for the Internet Assigned Numbers Authority
(IANA).
If you're writing a server for a new protocol, there are a few rules
and suggestions you should obey when choosing your server's port:
- Ports 1-1023 are off-limits to people inventing new
protocols. They are reserved by the IANA for standard protocols
like POP3 and HTTP (110 and 80, respectively). Until your
protocol is granted a port in this range by the IANA, you should
use something outside this range. id Software's choice of port
666 for their DOOM game server is cute, but it violates this
rule. They cleaned up their act with Quake: it uses port 6112.
- Ports 1024 through 49151 are Registered Ports,
which are a good range to choose your ports from. Just
beware that the entire world is choosing from ports
in this range, so it may make sense for you to register
your port, or at least check the current
list of assigned ports.
- Ports 49152 through 65535 are Dynamic Ports, meaning that
operating systems use ports in this range when choosing random
ports. (The FTP protocol, for example, uses random ports in the
data transfer phase.) This is a poor range to choose ports from,
because there's a fairly decent chance that your program and
the OS will fight over a given port eventually.
- Many OSes pick local ports for client programs from the
1024-5000 range. You would do well to pick server ports higher
than 5000, but this is not as rigid a rule as the previous ones.
- There are plenty of uncontested port numbers to choose from in
the "safe" 5000-49151 range. You should avoid port numbers with
patterns to them, or a widely-recognized meaning. People tend to
pick these since they're easy to remember, but this increases
the chances of a collision. Ports 6969, 5150 and 22222 are bad
choices, for example.
You should also give some thought to making your program's port
configurable, in case your program is run on a machine where another
server is already using that port. One way to do this is through Winsock's
getservbyname() function: if that function returns a port
number, use that, otherwise use the default port number. Then users
can change your program's port by editing the SERVICES file, located
in %WINSYSDIR%\DRIVERS\ETC on Windows NT derivatives and c:\Windows
on Windows 95 derivatives.
3.6 - What is TCP?
The Transmission Control Protocol is a reliable stream
protocol. "Reliable" means that Winsock always succeeds in sending the
data to the remote peer: TCP can deal with lost, corrupted,
duplicated and fragmented packets. "Stream" means that the remote peer
sees incoming data as a stream of individual bytes: there is no notion
of packets, from the program's viewpoint.
Winsock gives you a TCP socket when you pass SOCK_STREAM
as the second argument to socket().
TCP can coalesce sends, for efficiency: if you make four quick
send() calls to Winsock with 100, 50, 30 and 120 bytes in each,
Winsock is likely to pack all these up into a single 300-byte TCP
packet when it decides to send them out on the network. (This is
called the Nagle algorithm.) Compare UDP.
3.7 - What is UDP?
The User Datagram Protocol is an alternative
to TCP. Sometimes you see the term "TCP/IP" used to refer
to all basic Internet technologies, including UDP, but the proper term
is UDP/IP, meaning UDP over IP.
Winsock gives you a UDP socket when you pass SOCK_DGRAM
as the second argument to socket().
UDP is an "unreliable" protocol: the stack does not make
any effort to handle lost, duplicated, or out-of-order packets. UDP
packets are checked for corruption, but a corrupt UDP packet is simply
dropped silently.
The stack will fragment a UDP datagram when it's larger than the
network's MTU. The remote peer's stack will reassemble
the complete datagram from the fragments before it delivers it to the
receiving application. If a fragment is missing or corrupted, the whole
datagram is thrown away. This makes large datagrams impractical: an 8K
UDP datagram will be broken into 6 fragments when sent over Ethernet,
for example, because it has a 1500 byte MTU. If any of those 6 fragments
is lost or corrupted, the stack throws away the entire 8K datagram.
Datagram loss can also occur within the stack at the sender or the
receiver, usually due to lack of buffer space. It is even possible for
two communicating programs running on the same machine to have data
loss if they use UDP. (This actually happens on Windows under high load
conditions, because it starts dropping datagrams when the stack buffers
get full.) This limits UDP's value as a local IPC mechanism.
If any of these types of loss occur, no notification will be sent to
the sender or receiver, even if the loss happens within the network
stack.
Duplicated datagrams are not dropped: they are delivered to the
receiver. It is up to the application to detect this problem, and it is
the program's choice what to do with the duplicate datagram.
UDP datagrams can be delivered in any order. Datagrams often get
reordered on the network when two datagrams get delivered via different
routes, and the second datagram's route happens to be quicker.
3.8 - What is UDP good for?
From the above discussion, UDP looks pretty
useless, right? Well, it does have a few advantages over reliable
protocols like TCP:
- UDP is a slimmer protocol: its protocol header is fixed
at 8 bytes, whereas TCP's is 20 bytes at minimum and can be
more.
- UDP has no congestion control and no data coalescing. This
eliminates the delays caused by the delayed
ACK and Nagle algorithms. (This is
also a disadvantage in many situations, of course.)
- There is less code in the UDP section of the stack
than the TCP section. This means that there is less latency
between a packet arriving at the network card and being delivered
to the application.
- Only UDP packets can be broadcast
or multicast.
This makes UDP good for applications where timeliness and control is
more important than reliability. Also, some applications are inherently
tolerant of UDP problems: data loss in a streaming video program just
means a frame or two is dropped.
Be careful not to let UDP's advantages blind you to its bad points: too many application writers have started
with UDP, and then later been forced to add reliability features. When
considering UDP, ask yourself whether it would be better to use TCP from
the start than to try to reinvent it. Note that you can't completely
reinvent TCP from the Winsock layer. There are some features of TCP like
path MTU discovery that require low-level access to the OS's networking
layers. Other features of TCP are possible to duplicate over UDP, but
difficult to get right. Keep in mind, TCP/IP has been around for about
a quarter of a century now. A whole lot of effort has gone into tuning
this protocol suite for reliability and performance.
If you need a balance between UDP and TCP, you might investigate RTP
(RFC 1889) and SCTP (RFC 2960). RTP
is a higher level prototocol that usually runs over UDP and adds packet
sequence numbers, as well as other features. SCTP runs directly on top of
IP like TCP and UDP; it is a reliable protocol like TCP, but is datagram
oriented like UDP.
3.9 - How do I send a broadcast packet?
With the UDP protocol you can send a packet so that all workstations
on the network will see it. (TCP doesn't allow broadcasting.)
To send broadcast packets, you must first enable the
SO_BROADCAST option with the setsockopt()
function. Then you simply send packets out using a special broadcast
address.
The universal broadcast address is 255.255.255.255. Its advantage is
that it's generic. The disadvantage is that, because it can theoretically
refer to every IP-connected machine on the planet, many network nodes
will drop universal broadcast packets.
A smarter plan is to use your subnet's "directed broadcast" address.
This is an address you calculate using a network interface's IP address
and its netmask; packets sent to that address will stay within the
subnet, so often routers that would drop a universal broadcast will
pass directed broadcasts. To construct the directed broadcast address,
do something like this:
u_long host_addr = inet_addr("172.16.77.88"); // local IP addr
u_long net_mask = inet_addr("255.255.224.0"); // LAN netmask
u_long net_addr = host_addr & net_mask; // 172.16.64.0
u_long dir_bcast_addr = net_addr | (~net_mask); // 172.16.95.255
Potential Problems: Broadcasts can be useful at times,
but keep in mind that this creates a load on all the machines on the
network, even on machines that aren't listening for the packet. This is
because the part of the stack that can reject the packet is
several layers down. To get around this problem, you may want to consider
multicasting
instead.
3.10 - Is Winsock thread-safe?
The Winsock specification does not mandate that a Winsock
implementation be thread-safe, but it does allow an implementor
to create a thread-safe version of Winsock.
Bob Quinn says, on this subject:
- "WinSock, any implementation, is thread safe if the
WinSock implementation developer makes it so (it doesn't just
happen)."
- "I don't know of any implementations from Microsoft (or any
other vendors) that are not thread safe."
- "If a WinSock application developer creates a multi-threaded
application that shares sockets among the threads, it is that
developer's responsibility to synchronize activities between
the threads."
By "synchronize activities", I believe Bob means that it may cause
problems if, for example, two threads repeatedly call send()
on the same socket. There is no guarantee in the Winsock specification
about how the data will be interleaved in this situation. Similarly,
if one thread calls closesocket() on a socket, it must somehow
signal other threads using that socket that the socket is now invalid.
Anecdotal evidence suggests that one thread calling send()
and another thread calling recv() on a single socket is safe on
recent Microsoft stacks at least.
Instead of multiple threads accessing a single socket, you may
want to consider setting up a pair of network I/O queues. Then, give
one thread sole ownership of the socket: this thread sends data from
one I/O queue and enqueues received data on the other. Then other
threads can access the queues (with suitable synchronization).
Applications that use some kind of non-synchronous socket typically
have some I/O queue already. Of particular interest in this case is
overlapped I/O or I/O completion ports, because these I/O strategies
are also thread-friendly. You can tell Winsock about several OVERLAPPED
blocks, and Winsock will finish sending one before it moves on to the
next. This means you can keep a chain of these OVERLAPPED blocks, each
perhaps added to the chain by a different thread. Each thread can also
call WSASend() on the block they added, making your main loop
simpler.
3.13 - How do I detect if there is an Internet connection?
It is sometimes useful for a Winsock program to only do its
thing if the computer is already connected to the Internet. In many
cases, "connected to the Internet" means having a dial-up networking
connection. See this example for code
that checks for such a connection.
This doesn't work in all situations, however. The first problem
is, not everyone uses a modem to connect to the Internet. Often a
computer is hooked to a LAN, and one of the stations on the LAN acts
as a gateway to the Internet. You could poke around in the
system's network configuration to see if they have a gateway configured,
but then you run into the problem that gateways are used for things other
than simply connecting a LAN to the Internet. Even if the LAN is
sometimes gatewayed to the Internet, the gateway's Internet connection
might not always be up, or it might be configured to block access to
some sites.
Another issue is that even if the PC does have a modem for
connecting to the Internet, it might be disconnected but configured to
auto-dial. In this case, the fact that the modem is currently disconnected
is not a problem: your program should blindly try to connect, which will
bring the connection up.
The moral of the story is, it's usually best not to even check for an
Internet connection. Simply assume that the user knows what they're doing
by launching your program. Try the connection, and if it fails because
there is no Internet connection, you can tell the user about it and leave
fixing the problem up to the user. You might also consider making your
program's connection handling user-configurable: let the user tell you
whether it's correct to check for a dial-up networking connection or not,
and whether your program should blindly try the connection or not. Often
the user knows more about their system than your program can guess.
3.15 - Windows 9x's Dial Up Networking keeps popping up an automatic dial window, even when it isn't necessary. Can I make it stop?
On some PCs running a Windows 95 derivative, Dial Up Networking (DUN)
sometimes pops up an automatic-dial window even when it is obviously not
required. The most common time this happens is when the machine has both
a LAN adapter and a modem for connecting to the Internet.
The most common trigger for the DUN dial window is a Winsock program
calling the gethostbyname() function, which initiates a DNS
lookup. Even if the name is that of a LAN machine and there's a DNS
server on the LAN, DUN will still try to bring up the Internet link to
try that first. This problem is due to limitations in Win9x's
ability to handle multiple network interfaces.
The best solution is to just use straight IP addresses, and write
your programs to recognize an IP address, so they don't have to call
gethostbyname().
3.16 - I've heard that asynchronous sockets are unreliable. Is this true?
Asynchronous sockets are reliable if your program obeys the letter
of the Winsock specification.
Every so often, you hear stories about a program that loses asynch
notification messages. As far as I can tell, it's always due to a bug in
the complainer's program, due to misunderstanding Winsock's parsimonious
notification policy.
Consider the FD_WRITE notification. That only gets sent when a client's
connection is accepted by the remote peer, and from then on only when
output buffer space becomes available after Winsock gives you a
WSAEWOULDBLOCK error. To put it
another way, FD_WRITE only gets sent to say, "Before now, it was not
okay to write data on this socket; now it's okay." The conservative way
to handle this is to always try to send data when you have it, whether
you've received an FD_WRITE or not. You might get a WSAEWOULDBLOCK error,
but that's harmless and easy to handle. Your handler for FD_WRITE then
just tries to send everything queued up until it sends it all or gets
another WSAEWOULDBLOCK.
Win16 message queues are fixed-length and fairly short, so it is
at least possible to lose notifications in 16-bit programs. If Winsock
fails to send you a notification because the message queue is full, it is
supposed to keep trying, but empirical evidence suggests that this does
not always happen. Keep in mind that when we speak of "16-bit Winsock"
we're talking about stacks from a dozen different vendors, each with
many versions spanning many years.
I've been using asynchronous sockets almost exclusively for many
years now with no problems. Others who've been using asynchronous
notification for years longer than I have agree. If you believe you're
losing notifications, you have to ask yourself whether it's more likely
that we've overlooked a bug in the stack or that there's a bug in your
program.
3.17 - What is the Nagle algorithm?
The Nagle algorithm is an optimization to TCP that makes the stack
wait until all data is acknowledged on the connection before it sends
more data. The exception is that Nagle will not cause the stack to
wait for an ACK if it has enough enqueued data that it can fill a
network frame. (Without this exception, the Nagle algorithm
would effectively disable TCP's sliding window
algorithm.) For a full description of the Nagle algorithm, see RFC 896.
So, you ask, what's the purpose of the Nagle algorithm?
The ideal case in networking is that each program always sends a full
frame of data with each call to send(). That maximizes
the percentage of useful program data in a packet.
The basic TCP and IPv4 headers are 20 bytes each. The worst
case protocol overhead percentage, therefore, is 40/41, or 98%. Since
the maximum amount of data in an Ethernet frame is 1500 bytes, the best
case protocol overhead percentage is 40/1500, less than 3%.
While the Nagle algorithm is causing the stack to wait for data to
be ACKed by the remote peer, the local program can make more calls to
send(). Because TCP is a stream protocol,
it can coalesce the data in those send() calls into a single TCP
packet, increasing the percentage of useful data.
Imagine a simple Telnet program: the bulk of a Telnet conversation
consists of sending one character, and receiving an echo of that character
back from the remote host. Without the Nagle algorithm, this results
in TCP's worst case: one byte of user data wrapped in dozens of bytes
of protocol overhead. With the Nagle algorithm enabled, the TCP stack
won't send that one Telnet character out until the previous characters
have all been acknowledged. By then, the user may well have typed another
character or two, reducing the relative protocol overhead.
This simple optimization interacts with other features of the TCP
protocol suite, too:
- Most stacks implement the delayed
ACK algorithm: this causes the remote stack to delay ACKs
under certain circumstances, which allows the local stack a bit
of time to "Nagle" some more bytes into a single packet.
- The Nagle algorithm tends to improve the percentage of useful
data in packets more on slow networks than on fast networks,
because ACKs take longer to come back.
- TCP allows an ACK packet to also contain data. If the local
stack decides it needs to send out an ACK packet and the Nagle
algorithm has caused data to build up in the output buffer,
the enqueued data will go out along with the ACK packet.
The Nagle algorithm is on by default in Winsock, but it can
be turned off on a per-socket basis with the TCP_NODELAY option of
setsockopt(). This option should not be
turned off except in a very few situations.
Beware of depending on the Nagle algorithm too heavily. send()
is a kernel function, so every call to send() takes much more
time than for a regular function call. Your application should coalesce
its own data as much as is practical to minimize the number of calls
to send().
3.18 - When should I turn off the Nagle algorithm?
Almost never.
Inexperienced Winsockers usually try disabling the Nagle algorithm when
they are trying to impose some kind of packet
scheme on a TCP data stream. That is, they want to be able to send,
say, two packets, one 40 bytes and the other 60, and have the receiver
get a 40-byte packet followed by a separate 60-byte packet. (With the
Nagle algorithm enabled, TCP will often coalesce these two packets
into a single 100 byte packet.) Unfortunately, this is futile, for the
following reasons:
- Even if the sender manages to send its packets individually,
the receiving TCP/IP stack may still coalesce the received packets
into a single packet. This can happen any time the sender can
send data faster than the receiver can deal with it.
- Winsock Layered Service Providers (LSPs) may coalesce or
fragment stream data, especially LSPs that modify the data as it
passes.
- Turning off the Nagle algorithm in a client program will
not affect the way that the server sends packets, and vice versa.
- Routers and other intermediaries on the network can fragment
packets, and there is no guarantee of "proper" reassembly with
stream protocols.
- If a packet arrives that is larger than the available space
in the stack's buffers, it may fragment a packet, queuing up
as many bytes as it has buffer space for and discarding the
rest. (The remote peer will resend the remaining data later.)
- Winsock is not required to give you all the data it has
queued on a socket even if your
recv() call gave Winsock
enough buffer space. It may require several calls to get all
the data queued on a socket.
Aside from these problems, disabling the Nagle algorithm almost always
causes a program's throughput to degrade. The only time you should disable
the algorithm is when some other consideration, such as packet timing,
is more important than throughput.
Often, programs that deal with real-time user input will disable
the Nagle algorithm to achieve the snappiest possible response, at the
expense of network bandwidth. Two examples are X Window servers and
multiplayer network games. In these cases, it is more important that
there be as little delay between packets as possible than it is to
conserve network bandwidth.
For more on this topic, see the Lame
List and the FAQ article How to Use TCP Effectively.
3.19 - What is TCP's sliding window?
In a naïve implementation of TCP, every packet
is immediately acknowledged with an ACK packet. Until the ACK arrives
from the receiver (in this naïve implementation, at any rate), the
sender does not send another packet. If the ACK does not arrive within
some particular time frame, the sending stack retransmits the packet.
The problem with this is that all that waiting limits network
throughput drastically. The minimum time between packets with such a
scheme must be at least twice the minimum round trip time for that
network, for the time to send the packet and for the time for the
receiver to send back an ACK. Add in processing time on each end,
temporary hardware faults (e.g. Ethernet collisions), retransmissions,
routing delays, and who knows what else: the stacks end up spending more
time waiting for ACKs than sending data. This is a problem because it
means you can't effectively fill a network pipe with a single socket.
The limit of data throughput over a network link is the maximum
amount of data it is possible to have in transit at once divided by the
round trip time. Imagine a naive TCP/IP implementation running over a
100BaseT Ethernet. The maximum payload size for TCP over Ethernet is
1460 bytes, and the 100BaseT round trip time is roughly 0.3 ms. 1460
divided by 0.0003 seconds comes out to 4.8 MB/s. If you've done any
speed testing on a 100BaseT Ethernet, you know you can hit 6 MB/s easily,
9 MB/s with switched Ethernet, and with good hardware and software you
can approach the theoretical maximum of 12.5 MB/s. That's two to three
times the data rate we calculated above. We owe that speed jump to TCP's
"sliding window".
A sliding window means that the stack can have several unacknowledged
packets "in flight" before it stops and waits for the remote peer to
acknowledge the first packet. When the TCP connection is established, the
stacks tell each other how much buffer space they've allocated for this
connection: this is the maximum window size. Since each peer knows how big
the remote peer's buffer is and how many unacknowledged bytes it has sent,
it will stop sending data when it calculates that the remote peer's buffer
is full. Each peer then sends window size updates in each ACK packet,
telling the remote peer that stack buffer space has become available.
Aside: "Why is it called a sliding window," you
ask? Imagine a TCP data stream as a long line of bytes. The sliding window
is how the sender sees the receiver's buffer: as a fixed-size "window"
sliding along the stream of bytes. One edge of the window is between
the last byte the receiver has read and the next byte to be read, and
the other edge is between the last byte in the receiver's input buffer
and the first byte to be sent from the sender's output buffer. As the
receiver reads bytes out of the network buffers, the window slides down
the stream; any time it slides into the sender's buffer, the sender
sends more data to fill up the window.
In Microsoft Winsock stacks, the sliding window defaults to 8 KB. That
means that if it sends 8 KB of data without receiving an acknowledgement
for the first packet, the stack won't send any more data until the first
packet is acknowledged or the retry timer goes off, at which point it
will try to send the first packet again. As each packet at the front of
the "window" gets acknowledged, the 8 KB window "slides" along the data
stream, allowing the remote peer to send more data.
Dividing Microsoft's 8 KB value by 0.0003 seconds gives about 26 MB/s,
which means you hit the medium's maximum data rate (~12 MB/s) before
you hit the limit imposed by the round trip time.
Some networks have long round trip times which require large TCP
windows if your application needs to be able to fill the entire pipe
with a single TCP stream. Satellite systems are the most common example
of this: the minimum round trip time we see on our satellite Internet
connection at work is about 600ms! Some DSL systems have pretty long round
trip times, too, though not nearly as bad as satellite systems. You need
to run the numbers to find out what the situation is for your system.
For what it's worth, typical modem round trip times are in the 100-250
ms range. Calculating for 250 ms comes out to 32 KB/s, about five times
the data rate of the fastest modem connections you're likely to see. In
other words, an 8 KB window is plenty large for modems, despite the long
round trip times.
The MS Knowledge Base has articles that show how to change the TCP
window size for Windows NT derivatives (Q120642) and
Windows 95 derivatives (Q158474).
See the next two items for related discussion.
3.20 - What is the silly window syndrome?
The silly window syndrome results when the sender can send data faster
than the reciever can handle it, and the receiver calls recv()
with very small buffer sizes.
The fast sender will quickly fill the receiver's TCP window. The receiver then reads
N bytes, N being a relatively small number compared to the
network frame size. A naïve stack
will immediately send an ACK to the sender to tell it that there are
now N bytes available in its TCP window. This will cause the sender to
send N bytes of data; since N is smaller than the frame size, there's
relatively more protocol overhead in the packet compared to a full
frame. Because the receiver is slow, the TCP window stays very small,
and thus hurts throughput because the ratio of protocol overhead to
application data goes up.
The solution to this problem is the delayed
ACK algorithm. This causes the window advertisement ACK to be
delayed a bit, hopefully allowing the slow receiver to read more of
the enqueued data before the ACK goes out. This results in a larger
window advertisement, so the fast sender can send more data in a single
frame.
Note that the delayed-ACK solution doesn't mean your program can
safely use small recv() buffers. You should still read as much
as is reasonable in a single call, if only to minimize the number of
context switches between kernel and user space.
3.22 - What platform should I deploy my server on?
Assuming that you've decided to use Windows, your only real
choice for handling high loads is one of the Server class versions of
Windows.
It has been shown
that Windows NT Workstation uses an identical kernel to NT Server.
However, at startup time, NT Workstation's kernel cripples itself with
respect to NT Server's run-time behavior. The same thing happens on the
Win2K variants. More recently, Microsoft has completely separated their
personal and server operating systems with Windows XP on the one side
and Windows 2003 Server on the other.
The most important difference is that the
connection backlog on the
workstation-class OSes is limited to 5 slots. This means that your program
has to call accept() fast enough that not more than 5 connections
build up in the network stack's connection backlog. The stack rejects new
connections as long as the queue is full. For a well-written server,
this is not normally a problem, but it does mean that a concerted
attack (a SYN flood, for example) can fill the queue, denying service
to legitimate users. The server-class OSes have much higher connection
backlog limits and also have features specifically designed to minimize
the impact of a SYN attack.
A less important difference from a practical standpoint is that the
EULA for Microsoft's workstation-class operating systems prohibit running
a program that handles more than than 10 connections concurrently. I
don't know of any recent version of Windows that enforces this limit in
the kernel.
The Windows 95 derivatives are also unsuitable for use as servers, for
a number of reasons:
- They share the 5-slot backlog limit of the workstation-class
Windows NT derivatives.
- The performance of their stacks are objectively inferior
to those in the NT derivatives. Simple tests to show this are
timing the connection accept time and throughput of a single
connection. It gets worse as the number of concurrent connections
goes up.
- Their kernels are much less stable.
- Their kernels lack overlapped I/O support. (It's emulated
out in user space.)
- I/O completion ports are completely missing.
- The networking subsystem doesn't handle multiple network
cards very well.