Newcomers to network programming almost always run into problems
early on where it looks like the network or the TCP/IP stack is munging
your data. This usually comes as quite a shock, because the newcomer
is usually told just before this that TCP is a reliable data transport
protocol. In fact, TCP and Winsock are quite reliable if you use them
properly. This tutorial will discuss the most common problems people
come across when learning to use TCP.
I think that understanding this issue is one of TCP/IP's rites of
passage.
The easiest method is to prefix each packet with a length value. For
example, you could prefix every packet with a 2-byte unsigned integer
that tells how long the packet is. Length prefixes are most effective
when the data in each protocol packet has no particular structure,
such as raw binary data. See this example
for code that reads length-prefixed packets from a TCP stream.
Another method for setting up packets on top of a stream protocol is
called "delimiting". Each packet you send in such a scheme is followed
by a unique delimiter. The trick is to think of a good delimiter;
it must be a character or string of characters that will never
occur inside a packet. Some good examples of delimited protocols are
NNTP, POP3, and SMTP, all of which use a carriage-return/line-feed
("CRLF") pair as their delimiter. Delimiting generally only works well
with text-based protocols, because by design they limit themselves to
a subset of all the legal characters; that leaves plenty of possible
delimiters to choose from.
It's also possible to have a mixed approach. HTTP, for example, has
CRLF-delimited headers, one of which can be "Content-length", which is
a length prefix for the data following the headers.
Of these two methods, I prefer length-prefixing, because delimiting
requires your program to blindly read until it finds the end of the
packet, whereas length prefixing lets the program start dealing with the
packet just as soon as the length prefix comes in. On the other hand,
delimiting schemes lend themselves to flexibility, if you design the
protocol like a computer language; this implies that your protocols
parsers will be complex.
There are a couple of other concerns for properly handling packets
atop TCP. First, always check the return value of recv(),
which indicates how many bytes it placed in your buffer —
it may well return fewer bytes than you expect. Second, don't try
to peek into the Winsock stack's
buffers to see if a complete packet has arrived. For various reasons,
peeking causes problems. Instead, read all the data directly into your
application's buffers and process it there.
Problem 2: Byte Ordering
You have undoubtedly noticed all the ntohs() and htonl()
calls required in Winsock programming, but you might not know
why they are required. The reason is that there are two common
ways of storing integers on a computer: big-endian and
little-endian. Big-endian numbers are stored with the
most significant byte in the lowest memory location ("big-end first"),
whereas little-endian systems reverse this. Obviously two computers
must agree on a common number format if they are to communicate, so the
TCP/IP specification defines a "network byte order" that the headers
(and thus Winsock) all use.
The end result is, if you are sending bare integers as part of your
network protocol, and the receiving end is on a platform that uses a
different integer representation, it will perceive the data as garbled. To
fix this, follow the lead of the TCP protocol and use network byte order,
always.
The same principles apply to other platform-specific data formats,
such as floating-point values. Winsock does not define functions to create
platform-neutral representations of data other than integers, but there
is a protocol called the External
Data Representation (XDR) which does handle this. XDR formalizes
a platform-independent way for two computers to send each other
various types of data. XDR is simple enough that you can probably
implement it yourself; alternately, you might take a look at the
Libraries page to find libraries that
implement the XDR protocol.
For what it's worth, network byte order is big-endian, though you
should never take advantage of this fact. Some programmers working on
big-endian machines ignore byte ordering issues, but this makes your
code non-portable, and it can become a bad habit that will bite you
later. The most common little-endian CPUs are the Intel x86 and the
Digital Alpha. Most everything else is big-endian. There are a few
"bi-endian" devices that can operate in either mode, like the PowerPC
and the HP PA-RISC 8000. Most PowerPCs always run in big-endian mode,
however, and I suspect that the same is true of the PA-RISC.
Problem 3: Structure Padding
To illustrate the structure padding problem, consider this C
declaration:
struct foo {
char a;
int b;
char c;
} foo_instance;
Assuming 32-bit ints, you might guess that the structure
occupies 6 bytes, but this is not so. For efficiency reasons, compilers
"pad" structures to align the data members in a way that is convenient
for the CPU. Most CPUs can access 32-bit integers faster if they are at
addresses evenly divisible by 4, so the above structure would probably
take up 12 bytes on these systems. This issue rears its head when you
try to send a structure over Winsock whole, like this:
send(sd, (char*)&foo_instance, sizeof(foo), 0);
Unless the receiving program was compiled on the same machine
architecture with the same compiler and the same compiler options,
you have no guarantee that the other machine will receive the data
correctly.
The solution is to always send structures "packed" by sending the
data members one at a time. You can force your compiler to pack the
structures for you, with a resulting speed penalty in the code that
accesses those structures. Visual C++ can do this with the /Zp
command line option or the #pragma pack directive, and Borland
C++ can do this with the -a command line option. Keep the byte
ordering problem in mind, however: if you send a packed structure in
place, be sure to reorder its bytes properly before you send it.
The Moral of the Story
Trust Winsock to send your data correctly, but don't assume that it
works the way you think that it ought to!
Copyright © 1998-2004 by Warren Young. All rights
reserved.