Sunday, August 1, 2004

Network Protocols and Vectorization

Doing things in parallel is one of the older performance tricks.  Vector SIMD machines -- like the Cray supercomputers -- attack problems that benefit from doing the same thing to lots of different pieces of data simultaneously.  It's just a performance trick, but it drove the design and even the physical shape of those machines because the problems they're trying to tackle -- airflow simulation, weather prediction, nuclear explosion simulation, etc. -- are both important and difficult to scale up.  (More recently, we're seeing massively parallel machines built out of individual commodity PCs; conceptually the same, but limited mostly by network latency/bandwidth.)

So what does this have to do with network protocols?  Just as the problems of doing things like a matrix-vector multiply very, very fast drove the designs of supercomputers, the problems of moving data from one place to another very quickly, on demand drive the designs of today's network services.  The designs of network APIs (whether REST, SOAP, XML-RPC, or whatever) need to take these demands into account.

In particular, transferring lots of small pieces of data in serial fashion over a network can be a big problem.  Lots of protocols that are perfectly fine when run locally or over a LAN fail miserably when expected to deal with 100-200ms latencies on a WAN or the Internet.  HTTP does a decent job of balancing out performance/latency issues for retrieving human readable pages -- a page comes down as a medium-sized chunk of data, followed by, if necessary, associated resources such as scripts, style sheets, and binary images, which can all be retrieved in parallel/behind the scenes.  Note, that this is achieved only through lots of work on the client side and deep knowledge of the interactions between HTML, HTTP, and the final UI.  The tradeoff is complexity of protocol and implementation.

How does this apply to network protocols in general?  One idea is to carefully scrutinize protocol requests that transfer a single small piece of data.  Often a single small piece of data isn't very useful on its own.  Are there common use cases where a system will do this in a loop, perhaps serially, to get enough data to process or present to a user?  If so, perhaps it would be a good idea to think of "vectorizing" that part of the protocol.  Instead of returning a single piece of data, for example, return a variable-length collection of those pieces of data.  The semantics of the request may change only slightly -- from "I return an X" to "I return a set of X".  Ideally, the length should be dynamic and the client should be able to ask for "no more than N" on each request.

For example, imagine a protocol that requires a client to first retrieve a set of handles (say, mailboxes for a user) then query each one in turn to get some data (say, the number of unread messages).  If this is something that happens often -- for example, automatically every two minutes -- there are going to be a lot of packets hitting servers.  If multiple mailboxes are on one server, it would be fairly trivial to vectorize the second call and effectively combine the two queries into one -- call it "get mailbox state(s)".  This would let a client retrieve the state for all mailboxes on a given server, with better latency and far less bandwidth than the first option.  Of course there's no free lunch; if a client is dealing with multiple servers, it now has to group the mailboxes for each server for purposes of retrieving state.  But conceptually, it's not too huge of a leap.

There are other trade-offs.  If the "extra" data is large -- like a binary image -- it might well be better to download it separately, perhaps in parallel with other things.  If it's cacheable, but the main data isn't, it may again be better to separate it out so you can take advantage of things like HTTP caching. 

To summarize, one might want to vectorize part of a network protocol if:
  • Performance is important, and network latency is high and/or variable;
  • The data to be vectorized are always or often needed together in common use cases;
  • It doesn't over-complexify the protocol;
  • There's no other way to achieve similar performance in other ways (parallel requests, caching, etc.)
Of course, this applies to the Atom API.  There's a fair amount of vectorization in the Atom API from the start, since it's designed to deal with feeds as collections of entries.  I think there's a strong use case for being able to deal with collections of feeds as part of the Atom API as well, for all the reasons given above.  Said collections of feeds might be feeds I publish (so I want to know about things like recent comments...) or perhaps feeds I'm tracking (so I want to be able to quickly determine which feeds have something interesting, before downloading all of the most recent data).  It would be interesting to model this information as a synthetic feed, since of course that's already nicely vectorized.  But there are plenty of other ways to achieve the same result.