dsandler.org

One year at Google: a random sample.

August 10th, 2010




Keyboard features.

April 5th, 2010

March 13th, 2010

If I were looking to sneak up behind someone pumping gas and hold them up or steal their car, I’d definitely do it at a gas station with NBC At The Pump television screens embedded in each island. I spent about 5 minutes at the Natick Service Plaza in the slack-jawed thrall of one of these bastards—completely and utterly oblivious to my surroundings—before I realized what was going on. [Photo credit: Adam “Pass the Goddamn Butter” Lisagor.]

tl;dw

November 21st, 2009

Forgive me, Internet, I have sinned.

It has been six months since my last post. So much has happened since then that I have wanted to tell you about.

And I tried!

Drafts piling up.

Oh, I tried, several times—I have the WordPress drafts to prove it. But somehow anything too large for a tweet ended up a five-page essay with footnotes.1 (I would have written less, but you know how it goes.)

So. Let me be very brief:

DSC01128

  • I finished my doctoral studies and got my degree. You can read my thesis if such things interest you.

Mushface.

Ducks by the river - 30 Boston, from the harbor

  • My wife, The Boy, and I now live in the Boston area. We love it here.

Bugdroid and the Cupcake

  • I now work at Google (with some dear old friends) on the Android operating system. It’s pretty exciting. There is a lot of very hard work yet to do.

  • I am still interested in research topics, including distributed systems, security, and social networks. I also continue to maintain my Mac software on the side. These pursuits have taken something of a back seat to the above bullets, however.

Whew.

Now I can start making notes (of the longer-than-140-character variety) again. Thanks, Internet, for your kind indulgence.


  1. No, seriously. 

Re: Twitter spam.

May 5th, 2009

Abstract

People are suddenly concerned about @reply spam on Twitter. It’s not here yet, exactly, but it will show up—along with its worse cousin, search spam—and then we’ll be in trouble unless Twitter stops showing us messages from strangers quite so readily.

Here comes everybody else

Via Scoble, Loïc Le Meur last week sounded the alarm about Twitter @reply spam:

It had to happen. I just got my first @loic spam and there is no way to filter it. Spammers got it. It’s very simple for them, they just tweet your @username, show up in your replies timeline and bingo that’s the one you pay so much attention at so you won’t miss it. There is no way for you to filter it. None. Yet.

So, it’s true: anyone can create a message that appears in your “Mentions” view, just by, er, mentioning you. Let’s see what that means, and if it’s really that dire.

Asking for it

First, let me say that I don’t think spam will be the death of Twitter; all complex ecosystems have parasites, and spam is a sign that Twitter is a success. In fact, abuse of a public communication system is like crime (or terrorism) in the physical world: you can’t eliminate it entirely, but you can reduce it from a crisis to a nuisance.

This has been a bitter struggle in the world of email, which was forged in the good old days of mutually-trusting research, military, and corporate networks and is therefore every bit as happy to deliver you a message from a deposed prince of Nigeria or vendor of pharmaceuticals as it is from a close friend or colleague.

Twitter, on the other hand, is remarkably resistant to abuse because of its subscription-based information flow: users get only what they ask for. In general, you don’t see messages from strangers, so there’s no way for spammers to get arbitrary text in front of your eyeballs.1

No way, that is, unless you’re in the habit of routing around the subscription system and explicitly looking for messages from strangers. You can do this in one of three ways:

  1. Viewing the public timeline (which nobody actually does);
  2. Searching across Twitter; or
  3. Looking at your mentions (messages containing @username, which—when they were confined to messages beginning with @username—used to be called “replies”).

So far, Twitter spam has been safely held at bay—a nuisance, not a crisis. Mention/reply spam is just not that common. Poking through my handy Twitter data set from last September2, I didn’t find much in the way of @replies that obviously looked like spam.3 I admit to being surprised that this kind of spam isn’t more common, but I think my colleague Mike has hit upon the reason: you can’t reach very many people at once. There’s a finite amount of space available in a tweet, and the spammer has to divide it between recipients and payload. Assuming the average twitter nickname is 9 characters4 (plus a space and an @) and the average TinyURL is 25, a single spam tweet5 can target at most 140 – 25 – 1 = 114 / (9+2) = about 10 recipients.

Rate limiting by Twitter caps the number of messages that can be sent out by a particular user to just over one per minute. As a result, in order to reach a million Twitter users, a spammer would need to send out 100 messages/hour, each mentioning 10 users, from 1000 junk accounts. This is certainly practical, although not quite the same scale as botnet-fueled email junk. As far as I know, it’s not in common use just yet (although there exist apps like TwitterHawk that occupy a gray area between obviously legitimate and spammy use of Twitter).

Twitter search (and spam) for everyone!

The most effective vector (most eyeballs; fewest effort) attacks user behavior #2 above by spamming search terms; when Twitter launched its election.twitter.com live feed, it wasn’t long before people figured out that this was a soft target. And URL shorteners obfuscators exacerbate the problem by making it very difficult to know whether to click a URL provided by a stranger in search results.

This is where I think we may now be in for some real trouble. On Thursday, Twitter added real-time search (including a smattering of trending topics) to the main web interface. In a blog post entitled Twitter search for everyone!, Biz Stone explains:

What was that loud noise outside your apartment? Did you just feel an earthquake? What do people think about your company, your product, or your city? With this newly launched feature,6 Twitter has become something unexpectedly important—a discovery engine for finding out what is happening right now.

Twitter teaches us new and amazing things every day and a big lesson learned is that search is so much more than a box and a button. As public tweets fly in from around the globe, we analyze them to detect when certain words or phrases occur with higher frequency. These trending phrases are surfaced in the Twitter home page just under the new search box and they’re updated throughout the day. Built on our search technology, trends are a compelling if rudimentary way to explore a collective global consciousness.

Real-time search, and its potential to rapidly spread very, very current information (whether accurate or not, and usually lacking in analysis) is without a doubt one of the most interesting7 features of Twitter. By choosing to feature it so prominently in the main interface, Twitter has dramatically raised the profile of search results, and therefore considerably upended the economics of search spam (in favor of spammers).

A plan for Twitter spam?

Loïc proposes a solution (that applies equally well to search spam and @reply spam):

… add a “report as spam” button. This button would report the spammer to Twitter (or to a separate database of users) that we could exclude from the clients after a sufficient number of users report them as such and maybe some manual checking. Twitter could then just delete those users.

Others propose creating a large central tweet-filtering system akin to Akismet, the blog comments spam clearinghouse.

The problems with these approaches are manifold. Blocking/reporting suspicious users doesn’t scale (requiring users to take the time to manually identify and report junk) and opens the door for abuse (how many spurious reports would be necessary to get a legitimate user canned?). The Akismet approach—content-based filtering—continues to be an arms race, and even when you have access to millions of different users’ messages (as Twitter would, and Gmail does) it’s still hard to know what’s legit and what isn’t. And even if you take a guess (er, a statistical inference), do we now have to create a “Junk” folder for each Twitter user, including the option to “Report as not spam” to move things back into the inbox? I already have one overgrown garden to tend, thanks.

Flees a crowd

Security geeks are notorious for complaining about others’ proposals without offering a better one. Here’s my take: I think the only long-term, scalable approach is to remove “strangers” from common views of the system, forcing the user to take extra steps to go beyond his own subscriptions. In this world, the only junk most users will ever see is junk from friends and family, which can safely be classified as “not spam.” The mentions/replies view, which has become an important Twitter inbox (particularly for users following more than a handful of others), would show a subset of your main view—messages from random users don’t belong in either place. I believe this is in the spirit of how most individuals use Twitter: primarily, to stay in contact with a limited number of friends and colleagues, and only secondarily to engage the public at large.

So how do you meet new people on Twitter if you can’t see their messages? First, it should be possible to see tweets from strangers—just not the default. A “show messages from everyone” checkbox in the mentions or search views ought to suffice. Second, Twitter users are pretty good at passing along interesting stuff (via retweets), so there still ought to be plenty of cross-pollination between circles.

Finally, Twitter can exploit the social graph to show us messages from friends-of-friends (or friends-of-friends-of-friends, and so on, to any desired depth). Messages from users with some indirect connection to me are still unlikely to be spam, and moreover, if the goal is to discover new users I’m most likely to be interested in people connected to my existing contacts (crowdsourcing the task of crowd-finding, if you will). The CS community is really starting to get into this sort of thing, and although I haven’t yet published my work along these lines, it is this approach that I plan to take in FETHR, my proposed distributed microblogging platform.

Somewhat ironic postscript

As usual, blog comments and Twitter comments are open below. Of course, culling responses from Twitter relies on searching the entire unfiltered public timeline, which I just burned about a thousand words inveighing against. As I mentioned when I started the experiment, this is easily abused; for now I’m explicitly courting tweets from strangers, and we’ll just have to see how long it lasts.


  1. Except new-follower notifications. But because the attacker can’t control the content of this message (beyond his own username), you (the recipient) have to do some extra work to actually see spam content: click through to the user’s page and possibly on through to some arbitrary URL. This is known as follow spam

  2. I still intend to make that data set available; watch this space. 

  3. I looked for spammy words and URLs addressed as @replies; there were a few hits, but nowhere near the overwhelming proportion you’d need to declare spam a real problem. However, as discussed elsewhere in this piece, content-based spam identification is a real challenge, and I might certainly be missing something. You might be able to do better; see previous footnote. 

  4. Based on my data, the average is 8.96 chars. 

  5. It is standard procedure to coin an awkward portmanteau for every new type of spam encountered (spim, spit, spamdex, to name a few). It is also de rigueur to create Twitter-related neologisms in a similar fashion (typically centering around use of “Tw” as a sort of charming speech impediment). Convention would therefore dictate that we call spam tweets something like “twam” or “sweets.” Instead, I will break with established practice and propose that we simply refer to Twitter spam as “bird poop.” 

  6. It’s hardly “newly-launched”—many of us have been using search.twitter.com since before Twitter bought it from Summize. But it’s still new to many users, so I’ll let this slide. 

  7. Interesting for users, of course, but possibly also for business reasons: (1) it might be something Twitter could charge for in some form; (2) it might make Twitter worth buying. This may explain why search has been pushed in front of users’ noses: it’s useful for users, but it’s valuable to the company. (Too cynical?) 

FETHR roadmap.

April 30th, 2009

I’ve gotten some really excellent feedback about FETHR since I unveiled it at IPTPS last week. There’s been a steady hum of RTs and hosannahs on Twitter, a handful of thoughtful emails, and a few FriendFeed discussions (notably Chris Messina and Andy Baio1).

The FETHR slide deck has been particularly well-received. In it I compare microblogging in 2009 to email in 1983, an analogy which seems to resonate with people (at least, people of a certain age); I think it succinctly summarizes where we are today and what needs to happen in order for microblogging to become, in fact, a communication utility alongside email, IM, blogs, and so on.

FETHR, Laconica, and OpenMicroBlogging

I’ve gotten a lot of useful criticism and feedback as well. Several people asked whether I was aware of the open-source Laconica project and how what I’m doing differs. In short: yes, I’m aware of Laconica and OpenMicroBlogging (OMB). I started my work in April ’08, and when Laconica launched in July I was gratified to see someone else pursuing open microblogging.

I think FETHR and OMB are cousins. Evan Prodromou, the creator of Laconica and the owner of identi.ca (a Twitter-like multi-user microblogging service) and I both see the same need: for an ecosystem of microblogging systems that seamlessly2 interoperate. Each is a RESTful protocol designed to be used exclusively over HTTP; participants are uniquely identified by a canonical URL, which is used as a rendezvous point for API calls.

Substantial technical differences exist between the two protocols, however:3

  • Message distribution. A lesson I took away from my MS work on RSS feeds is that being popular is a curse: you have to satisfy all those hungry new readers. Unlike RSS, FETHR (and OMB) are push protocols, so those readers aren’t periodically making (redundant, useless) requests, but a popular microblogger (e.g. Heather Champ) might still have to make half a million HTTP requests every time she wants to post a message. FETHR addresses this problem by allowing a publisher to ask subscribers to assist with message dissemination by gossiping updates among one another.

  • Security. Each applies security techniques to prevent abuse, particularly spam (messages from sources the user isn’t subscribed to). OMB uses OAuth4 to secure individual connections, but this won’t fly in a true p2p environment like FETHR (in which you might receive a new message from someone other than its author). FETHR therefore secures data rather than channels; that is, individual messages are signed and hash chained in order to authenticate messages and a publisher’s timeline as a whole. (There are some other useful properties of this approach; see §3.2 of the IPTPS paper for details.)

  • Message content. Finally, as currently written, OMB is naturally focused on replicating the Twitter experience, including some maximum field lengths set at 140 characters (e.g. the user’s “location”) and specifying the size (96x96 pixels) of avatar images. It expands on Twitter in certain areas (including a sorely-needed “seealso” property to allow a message to refer to some external resource), but these too are fixed in the protocol. FETHR attempts to remain agnostic on these data, choosing instead to specify a bare minimum of necessary properties and allowing applications to superimpose arbitrary key-value data (read: a JSON object). Essentially, FETHR anticipates that application developers will create their own micro-formats5 that best describe the type of micropublishing service model that best suits them.

Birdfeeder roadmap

There’s lots of work to do on FETHR, and now that I’ve gotten another large blocking project out of the way, I can resume work on both the protocol and the Birdfeeder prototype. In rough priority order:

  1. Protocol docs. Many (including Dave Winer in a comment on Messina’s FriendFeed) have asked for a public specification of the FETHR protocol. I absolutely intend to provide this, although the exact wire protocol (as spoken by Birdfeeder) is something of a moving target at the moment. If you’re desperate, you can suss it out from the source, which brings me to…

  2. Refactoring. There’s a lot of work to do on the code; I grew it organically and unpredictably as my ideas about FETHR developed, and as a result, parts of it bear a strong resemblance to a pasta dinner. Aside from general cleanups (more comprehensive tests, documentation, etc.) I’d like to better separate the Birdfeeder front-end from the FETHR transport so that it’s easier to develop new applications without all the HTML.

  3. Local Twitter support in Birdfeeder. Currently, Twitter connectivity is handled by a single FETHR node, twittergw, which acts as a gateway between the two networks. (I first described the gateway in a previous blog post.) Long-term, Twitter should support the FETHR API directly; twittergw was created to provide a crude replica of such functionality. I’m running up against the limitations of this approach, however; I now plan to move Twitter support into the Birdfeeder client itself, overcoming the limitations of twittergw as well as facilitate the out-of-box experience for people looking to try out the system.

  4. More refactoring. Birdfeeder makes use of the web.py application framework, but it currently uses a number of work threads, which requires that web.py be used as a standalone server. I want Birdfeeder to operate as a stateless CGI, which means moving background tasks out of threads and into special URL handlers (suitable for tickling with a cron job). This would pave the way for…

  5. Google App Engine. The above improvements will allow Birdfeeder to run on GAE, which is a big step toward a large multi-user system that non-technical users can sign up (similar to identi.ca). Note, however, that the point of this project is not to create a public microblogging service, but to lay the foundation for many public micropublishing services; still, it can be beneficial to have a public example as an “ambassador” for the technology.

Research

This is just the “development” side of the R&D equation. I’ve got a stack of research topics to tackle as well (some of which can be found toward the end of my slide deck); I expect to pick these threads back up a little later in the summer.

Thoughts? Requests? I’ll leave the regular blog comments open so you can write longer notes than the Twitter watercooler will allow (although tweets are welcome as well).


  1. I owe particular thanks to Andy for giving the FETHR talk a huge bump by noting it on his widely-read links page for April 21st. 

  2. Well, almost seamlessly. In particular, the user experience for subscribing to a user becomes much more complex: you can’t simply create a “Follow!” button, because the subscriber may be using a different service. This is an open problem, related to that of RSS subscription, with similar solutions (copy/pasting addresses, bookmarklets, or new URI schemes). 

  3. This discussion is based on my reading of the current draft of the OpenMicroBlogging specification; there are portions of the document that are ambiguous and some important terms (such as “user”) are ill-defined. If you think I’ve misunderstood something, please leave a note below. 

  4. OAuth was designed to allow a user to grant a limited amount of authority from one service (of which he is a user) to another (which he also uses). For example, Twitter might want to look through your Gmail contacts to see if any of them is already a Twitter user; it could do this by asking for your Gmail username and password, but this would give Twitter access to all your email as well (which is not good). Instead, Twitter makes an OAuth request (on your behalf) to Gmail, which then asks you (interactively) if this is in fact what you want before handing a special authorization token back to Twitter for this specific purpose only (think of it as a “valet key”). Because it’s essentially a capability system for Web resources, OAuth doesn’t have to be used this way, and OMB chooses to use it as a way to allow one user (the subscriber) to delegate a right (the ability to send him messages) to another user (the publisher). 

  5. Not to be confused with Microformats, which are (X)HTML-based metadata that allow a human-readable webpage to be seamlessly interpreted by “semantic Web” agents. 

April 17th, 2009
Three years; 284 sheets of 20lb. bond; $120.00 (for binding) …and it’s DONE.