Wednesday, December 15, 2004

Spam spam spam spam...

Back in June, aparently, the FTC said that a do-not-email list (like the do-not-call list) would not work, and would generate more spam because spammers would use it as a source of new email addresses.  Though it's a bit late now, I have to wonder about the latter point.  Why not simply map each address into its MD5 checksum before storing it?

So foo@example.com would become "a0b6e8fd2367f5999b6b4e7e1ce9e2d2" which is useless for sending email.  However, spammers could use any of many available tools to check for "hits" on their email lists, so it's still perfectly usable for filtering out email addresses.  Of course it would also tell spammers that they have a 'real' email address on their list, but only if they already had it -- so I don't think that would be giving them much information at all.

I still think the list would be useless because spammers would simply ignore it.  But it wouldn't generate new spam, and it would drive up the cost of spamming by making the threat of legal action a bit more possible.

Tuesday, December 14, 2004

The Noosphere Just Got Closer

Of course it'll take several years, but Google's just announced project to digitize major university library collections means that the print-only "dark matter" of the noosphere is about to be mapped out and made available to anyone with an Internet connection.  Well, at least the parts that have passed into the public domain; the rest will be indexed.

I'm clearly a geek -- my toes are tingling.

Monday, December 13, 2004

The "5th Estate"

Interesting quote, from my point of view, in this article:

Jonathan Miller, Head of AOL in the US, testifies to the popularity of Citizen's Media. He says that 60 - 70 per cent of the time people spend on AOL is devoted to ‘audience generated content'.

(Though he's talking mostly about things like message boards and chat rooms, of course, rather than blogs.)

Monday, December 6, 2004

Welcome MSN Spaces!

A surprise to welcome me back from sabbatical: Microsoft released the beta of MSN Spaces (congratulations guys!).  I've been playing with it a bit over the past few days; there's some very cool stuff there, especially the integrations between Microsoft applications. 

(I've seen a few comments about the instability of the Spaces service; come on folks, it's a beta.  And they're turning around bug fixes in 48 hours while keeping up with what has got to be a ton of traffic.)

Wednesday, November 24, 2004

The Atom Publishing Protocol Summarized

The slides from Joe Gregorio's XML 2004 talk about the Atom Publishing Protocol are online.  It's an excellent summary, and makes a good case for the document literal and addressable web resource approaches.  The publishing protocol is where Atom really starts to get exciting.

Tuesday, November 23, 2004

Software Patents Considered Harmful

This post by Paul Vick is, I think, a very honest and representative take on software patents -- and in particular the over-the-top IsNot patent -- from the point of view of an innovator.  I find myself agreeing with him wholeheartedly:

Microsoft has been as much a victim of this as anyone else, and yet we're right there in there with everyone else, playing the game. It's become a Mexican standoff, and there's no good way out at the moment short of a broad consensus to end the game at the legislative level.

And we all know how Mexican standoffs typically end.  Paul, my name is on a couple of patents which I'm not proud of either.  But in the current environment, there really isn't a choice: We're all locked in to locally 'least bad' courses, which together work to guarantee the continuation of the downward spiral (and in the long run, make all companies worse off -- other than Nathan Myhrvold's, of course.)

Monday, November 22, 2004

Web Services and KISS

Adam Bosworth argues for the 'worse is better' philosophy of web services eloquently in his ISCOC talk and blog entry.  I have a lot of sympathy for this point of view.  I'm also skeptical about the benefits of the WS-* paradigm.  They seem to me to be well designed to sell development tools and enterprise consulting services.

Sunday, November 14, 2004

Why Aggregation Matters

Sometimes, I feel like I'm banging my head against a wall trying to describe just why feed syndication and aggregation is important.  In an earlier post, I tried to expand the universe of discourse by throwing out as many possible uses as I could dream up.  Joshua Porter has written a really good article about why aggregation is a big deal, even just considering its impact on web site design: Home Alone? How Content Aggregators Change Navigation and Control of Content

Monday, November 1, 2004

Prediction is Difficult, Especially the Future

My second hat at AOL is development manager for the AOL Polls system. This means I've had the pleasure of watching the conventions and debates in real time while sitting on conference calls watching the performance of our instant polling systems. Which had some potential issues, but which, after a lot of work, seem to be just fine now. Anyway: The interesting thing about the instant polling during the debates was how different the results were from the conventional instant phone polls. For example, after the final debate the AOL Instapoll respondents gave the debate win to Kerry by something like 60% to 40%. The ABC news poll was more like 50%/50%. Frankly, I don't believe any of these polls. However, I'll throw this thought out: The online insta polls are taken by a self selected group of people who are interested in the election and care about making their opinions known. Hmmm... much like the polls being conducted tomorrow.
I'll go out on a limb and make a prediction based on the various poll results and on a lot of guesswork: Kerry will win the popular vote by a significant margin. And, he'll win at least half of the "battleground" states by a margin larger than the last polls show. But, I make no predictions about what hijinks might ensue in the Electoral College.

Update 11/11: Well, maybe not...

Monday, October 18, 2004

Random Note: DNA's Dark Matter

Scientific American's The Hidden Genetic Program of Complex Organisms grabbed my attention last week.  This could be the biological equivalent of the discovery of dark matter.  Basically, the 'junk' or intron DNA that forms a majority of our genome may not be junk at all, but rather control code that regulates the expression of other genes. 

The programming analogy would be, I think, that the protein-coding parts of the genome would be the firmware or opcodes while the control DNA is the source code that controls when and how the opcodes are executed.  Aside from the sheer coolness of understanding how life actually works, there's a huge potential here for doing useful genetic manipulation.  It's got to be easier to tweak control code than to try to edit firmware... (Free link on same subject: The Unseen Genome.)

Monday, October 11, 2004

Things in Need of a Feed

Syndicated feeds are much bigger than blogs and news stories; they're a platform.  A bunch of use cases, several of which actually exist in some form, others just things I'd like to see:
Addendum 11/11:

Tuesday, October 5, 2004

Niche Markets

Niche markets are where it's at: Chris Anderson's The Long Tail is exactly right. The Internet not only eliminates the overhead of physical space but also, more importantly, reduces the overhead of finding what you want to near-zero. When your computer tracks your preferences and auto-discovers new content that you actually want, it enables new markets that couldn't otherwise exist.

Update 10/11: Joi Ito's take.

Sunday, August 1, 2004

Network Protocols and Vectorization

Doing things in parallel is one of the older performance tricks.  Vector SIMD machines -- like the Cray supercomputers -- attack problems that benefit from doing the same thing to lots of different pieces of data simultaneously.  It's just a performance trick, but it drove the design and even the physical shape of those machines because the problems they're trying to tackle -- airflow simulation, weather prediction, nuclear explosion simulation, etc. -- are both important and difficult to scale up.  (More recently, we're seeing massively parallel machines built out of individual commodity PCs; conceptually the same, but limited mostly by network latency/bandwidth.)

So what does this have to do with network protocols?  Just as the problems of doing things like a matrix-vector multiply very, very fast drove the designs of supercomputers, the problems of moving data from one place to another very quickly, on demand drive the designs of today's network services.  The designs of network APIs (whether REST, SOAP, XML-RPC, or whatever) need to take these demands into account.

In particular, transferring lots of small pieces of data in serial fashion over a network can be a big problem.  Lots of protocols that are perfectly fine when run locally or over a LAN fail miserably when expected to deal with 100-200ms latencies on a WAN or the Internet.  HTTP does a decent job of balancing out performance/latency issues for retrieving human readable pages -- a page comes down as a medium-sized chunk of data, followed by, if necessary, associated resources such as scripts, style sheets, and binary images, which can all be retrieved in parallel/behind the scenes.  Note, that this is achieved only through lots of work on the client side and deep knowledge of the interactions between HTML, HTTP, and the final UI.  The tradeoff is complexity of protocol and implementation.

How does this apply to network protocols in general?  One idea is to carefully scrutinize protocol requests that transfer a single small piece of data.  Often a single small piece of data isn't very useful on its own.  Are there common use cases where a system will do this in a loop, perhaps serially, to get enough data to process or present to a user?  If so, perhaps it would be a good idea to think of "vectorizing" that part of the protocol.  Instead of returning a single piece of data, for example, return a variable-length collection of those pieces of data.  The semantics of the request may change only slightly -- from "I return an X" to "I return a set of X".  Ideally, the length should be dynamic and the client should be able to ask for "no more than N" on each request.

For example, imagine a protocol that requires a client to first retrieve a set of handles (say, mailboxes for a user) then query each one in turn to get some data (say, the number of unread messages).  If this is something that happens often -- for example, automatically every two minutes -- there are going to be a lot of packets hitting servers.  If multiple mailboxes are on one server, it would be fairly trivial to vectorize the second call and effectively combine the two queries into one -- call it "get mailbox state(s)".  This would let a client retrieve the state for all mailboxes on a given server, with better latency and far less bandwidth than the first option.  Of course there's no free lunch; if a client is dealing with multiple servers, it now has to group the mailboxes for each server for purposes of retrieving state.  But conceptually, it's not too huge of a leap.

There are other trade-offs.  If the "extra" data is large -- like a binary image -- it might well be better to download it separately, perhaps in parallel with other things.  If it's cacheable, but the main data isn't, it may again be better to separate it out so you can take advantage of things like HTTP caching. 

To summarize, one might want to vectorize part of a network protocol if:
  • Performance is important, and network latency is high and/or variable;
  • The data to be vectorized are always or often needed together in common use cases;
  • It doesn't over-complexify the protocol;
  • There's no other way to achieve similar performance in other ways (parallel requests, caching, etc.)
Of course, this applies to the Atom API.  There's a fair amount of vectorization in the Atom API from the start, since it's designed to deal with feeds as collections of entries.  I think there's a strong use case for being able to deal with collections of feeds as part of the Atom API as well, for all the reasons given above.  Said collections of feeds might be feeds I publish (so I want to know about things like recent comments...) or perhaps feeds I'm tracking (so I want to be able to quickly determine which feeds have something interesting, before downloading all of the most recent data).  It would be interesting to model this information as a synthetic feed, since of course that's already nicely vectorized.  But there are plenty of other ways to achieve the same result.

Sunday, July 4, 2004

Office Space

How important is the physical workspace to knowledge workers generally, and software developers specifically?  Everybody agrees it's important.  Talk to ten people, though, and you'll get nine different opinions about what aspects are important and how much they impact effectiveness.  But there are some classic studies that shed some light on the subject; looking around recently, they haven't been refuted.  At the same time, a lot of people in the software industry don't seem to have heard of them.

Take the amount and kind of workspace provided to each knowledge worker.  You can quantify this (number of square feet, open/cubicle/office options).  What effects should you expect from, say, changing the number of square feet per person from 80 to 64?  What would this do to your current project's effort and schedule?

There's no plug-in formula for this, but based on the available data, I'd guesstimate that the effort would expand by up to 30%.  Why?

"Programmer Performance and the Effects of the Workplace" describes the Coding War Games, a competition in which hundreds of developers from dozens of companies compete on identical projects.  (Also described in Peopleware: Productive Projects and Teams.) The data is from the 1980's, but hasn't been replicated since as far as I can tell. The developers were ranked according to how quickly they completed the projects, into top 25%, middle 50%, and bottom 25%.  The competition work was done in their normal office environments.
  • The top 25% had an average of 78 square feet of dedicated office space.
  • The bottom 25% had an average of 46 square feet of dedicated office space.
  • The top 25% finished 2.6 times faster, on average, than the bottom 25%, with a lower defect rate.
  • They ruled out the idea that top performers tended to be rewarded with larger offices.
Now, whether larger workspaces improve productivity, or whether more productive people tend to gravitate to companies with larger workspaces, doesn't really matter to me as a manager.  Either way, the answer is the same: Moving from 46 square feet per person to 78 square feet per person can reduce the time to complete a project by a factor of up to 2.6x.  That's big.  (Of course there were other differences between the environment of the top 25% and the bottom 25%, but they are largely related to issues like noise, interruptions, and privacy.  It seems reasonable to assume these are correlated with people density.)

It itself, this doesn't give us an answer for the question we started out with (changing from 80 square feet to 64 square feet per person, and bumping up the people density commensurately).  As a first approximation, let's assume a linear relationship between dedicated area per person and productivity ratios.  64 is just over halfway between 46 and 78, so it seems reasonable to use half of the 2.6 factor, or 1.3, as a guesstimate.  So using this number, a project that was going to take two weeks in the old environment would take 1.3 times as long, or around two and a half weeks, in the new environment.  (In the long term, of course.)

To put this into perspective, it appears that increasing an organization's CMM level by one generally results in an 11% increase in productivity, and that the ratio of effort between worst and best real-world processes appears to be no more than 1.43.

You can't follow the numbers blindly here.  This probably depends a lot on the kind of work you actually do, and I can think of dozens of caveats.  My gut feeling is that the penalty is likely to be more like 10% than 30%, assuming you're really holding everything else (noise, interruptions, etc.) as constant as possible.  I suspect that the organizations which are squeezing people into ice cube sized cubicles are likely to be destroying productivity in other ways as well.  But, these numbers do provide some guidance as to what to expect in terms of costs and consequences of changing the workplace environment.

Links and references:


Thursday, July 1, 2004

Community, social networks, and technology at Supernova 2004

Some afterthoughts from the Supernova conference, specifically about social networks and community.  Though it's difficult to separate the different topics. 

A quick meta-note here: Supernova is itself a social network of people and ideas, specifically about technology -- more akin to a scientific conference than an industry conference.  And, it's making a lot of use of various social tools: http://www.socialtext.net/supernova/, http://supernova.typepad.com/moblog/.

Decentralized Work (Thomas Malone) sounds good, but I think there are powerful entrenched stakeholders that can work against or reverse this trend (just because it would be good doesn't mean it will happen).  I'm taking a look at The Future of Work right now; one first inchoate thought is how some of the same themes are treated differently in The Innovator's Solution.

The Network is People - a panel with Chrisopher Allen, Esther Dyson, Ray Ozzie, and Mena Trott.  Interesting/new thoughts:
  • Chris Allen on spreadsheets:  They are a social tool for convincing people with numbers and scenarios, just like presentation software is for convincing people with words and images.  So if you consider a spreadsheet social software, well, what isn't social software?
  • "43% of time is spent on grooming in large monkey troupes."  (But wait, what species of monkeys are we talking about here?  Where are our footnotes?)  So, the implication is that the amount of overhead involved in maintaining true social ties in large groups is probably very high.  Tools that would actually help with this (as opposed to just growing the size of your 'network' to ridiculous proportions) would be a true killer app. 
  • Size of network is not necessarily a good metric, just one that's easy to measure.  Some people really only want a small group.
Syndication Nation - panel with Tim Bray, Paul Boutin, Scott Rosenberg, Kevin Marks, Dave Sifry.  I felt that this panel had a lot of promise but spent a lot of time on background and/or ratholing on imponderables (like business models).  Kevin and Tim tried to open this up a bit to talk about some of the new possibilities that automatic syndication offers.  At the moment, it's mostly about news stories and blogs and cat pictures.  Some interesting/new thoughts:
  • Kevin stated that # of subscribers to a given feed follows a power law almost exactly, all the way down to 1.  So even having a handful of readers is an accomplishment.  One might also note that this means the vast majority of subscriptions are in this 'micropublishing' area.
  • New syndication possibilities mentioned: Traffic cameras for your favorite/current route. 
  • The Web is like a vast library; syndicated feeds are about what's happening now (stasis vs. change).  What does this mean?
  • The one interesting thing to come out of the how-to-get-paid-for-this discussion: What if you could subscribe to a feed of advertising that you want to see?  How much more would advertisers pay for this?  (Reminds me of a discussion I heard recently about radio stations going back to actually playing more music and less talk/commercials: They actually get paid more per commercial-minute because advertisers realize their ad won't be buried in a sea of crap that nobody is listening to.)
More on some of the other topics later. 

Friday, June 25, 2004

Supernova 2004 midterm update

I'm at the Supernova 2004 conference at the moment.  I'm scribbling notes as I go, and plan to go back and cohere the highlights into a post-conference writeup.  First impressions:  Lots of smart and articulate people here, both on the panels and in the 'audience'.  I wish there were more time for audience participation, though there is plenty of time for informal interactions between and after sessions.  The more panel-like sessions are better than the formal presentations.

The Syndication Nation panel had some good points, but it ratholed a bit on standard issues and would have benefited from a longer term/wider vision.  How to pay for content is important, but it's a well trodden area.  We could just give it a code name, like a chess opening, and save a lot of discussion time...

I am interested in the Autonomic Computing discussion and related topics, if for no other reason than we really need to be able to focus smart people on something other than how to handle and recover from system issues.  It's addressing the technical complexity problem.

Next problem: The legal complexity problem (IP vs. IP: Intellectual Property Meets the Internet Protocol) - I think this problem is far harder because it's political.  There's no good solution in sight for how to deal with the disruptions technology are causing business models and the structure of IP law.

And, on a minor note, I learned the correct pronunciation of Esther Dyson's first name.



Sunday, June 20, 2004

Atom Proposal: Simple resource posting

On the Atom front, I've just added a proposal to the Wiki: PaceSimpleResourcePosting. The abstract is:

This proposal extends the AtomAPI to allow for a new creation URI, ResourcePostURI, to be used for simple, efficient uploading of resources referenced by a separate Atom entry. It also extends the Atom format to allow a "src" attribute of the content element to point to an external URI as an alternative to providing the content inline.

This proposal is an alternative to PaceObjectModule, PaceDontSyndicate, and PaceResource. It is almost a subset of and is compatible with PaceNonEntryResources, but differs in that it presents a very focused approach to the specific problem of efficiently uploading the parts of a compound document to form a new Atom entry. This proposal does not conflict with WebDAV but does not require that a server support WeDAV.

Saturday, June 5, 2004

Atom: Cat picture use case

To motivate discussion about some of the basic needs for the Atom API, I've documented a use case that I want Atom to support: Posting a Cat Picture. This use case is primarily about simple compound text/picture entries, which I think are going to be very common.  It's complicated enough to be interesting but it's still a basic usage.

The basic idea here is that we really want compound documents that contain both text and pictures without users needing to worry about the grungy details; that (X)HTML already offers a way to organize the top level part of this document; and that Atom should at least provide a way to create such entries in a simple way.

Friday, June 4, 2004

Who am I?

Technorati Profile

I'm currently a tech lead/manager at Google, working on Blogger engineering.

I'm formerly a system architect and technical manager for web based products at AOL. I last managed development for Journals and Favorites Plus.  I've helped launch Public & Private Groups, Polls, and Journals for AOL.

History:

Around 1991, before the whole Web thing, I began my career at a startup which intended to compete with Intuit's Quicken software on the then-new Windows 3.0 platform.  This was great experience, especially in terms of what not to do[*]. In 1993 I took a semi-break from the software industry to go to graduate school at UC Santa Cruz.  About this time Usenet, ftp, and email started to be augmented by the Web.  I was primarily interested in machine learning, software engineering, and user interfaces rather than hypertext, though, so I ended up writing a thesis on the use of UI usability analysis in software engineering.

Subsequently, I worked for a startup that essentially attempted to do Flash before the Web really took hold, along with a few other things. We had plugins for Netscape and IE in '97.  I played a variety of roles -- API designer, technical documentation manager, information designer, project manager, and development manager.  In '98 the company was acquired by CA and I moved shortly thereafter to the combination of AtWeb/Netscape/AOL.  (While I was talking to a startup called AtWeb, they were acquired by Netscape and Netscape was in turn acquired by AOL -- an employment trifecta.)

At AtWeb I transitioned to HTML UIs and web servers, working on web and email listserver management software before joining the AOL Community development group.  I worked as a principal software engineer and then engineering manager.  I've managed the engineering team for the AOL Journals product from its inception in 2003 until the present time; I've also managed the Groups@AOL, Polls, Rostering, and IM Bots projects.

What else have I been doing? I've followed and promoted the C++ standardization process and contributed a tiny amount to the Boost library effort.  On a side note, I've taught courses inobject oriented programming, C++, Java, and template metaprogramming for UCSC Extension, and published two articles in the C++ Users Journal.

I'm interested in software engineering, process and agile methods, Web standards, language standards, generic programming, information architectures, user interface design, machine learning, evolution, and disruptive innovation,

First Post

The immediate purpose of this blog is to publish thoughts about web technologies, particularly Atom.  Of course that suffers from the recursive blogging-about-blogging syndrome, so I'll probably expand it to talk about software in general.

What does the name stand for?  Mostly, it stands for "something not currently indexed by Google".  Hopefully in a little while it will be the only thing you get when you type "Abstractioneer" into Google.  Actually it's a contraction of the "Abstract Engineering" which is a meme I'm hoping to propagate.  More on that later.