Real-world RSS throttling on Slashdot.
Slashdot is the grand-daddy of early adopter websites, and as such has the most-subscribed-to RSS feed in the known universe. (If you look at individual metrics such as the top Bloglines feeds, Slashdot wins handily with over 20,000 subscribers.) Slashdot may have been the first website to actually implement RSS throttling (due to being the first website to need it); the policy states that clients are blocked not by instantaneous polling frequency, but with a sliding hourlong window.
I recently wrote about RSS throttling techniques, and expressed my concern at the scalability of tracking RSS hogs on the server. Slashdot’s Jamie McCarthy has just written a thoughtful response, including details of Slashdot’s implementation.
I’ll grant that our accesslog traffic is pretty I/O intensive. But if you were only talking about logging RSS hits and nothing else, it’d be a piece of cake.
[…]
How many RSS hits can you get in an hour? A hundred thousand? That’s peanuts, especially since each row is fixed size.
Jamie goes on to explain that the access logs are only periodically (every two minutes) sifted into a separate RSS-tracking table, which is then combed for faulty clients. These clients are then blacklisted in a third table (actually a file perused by a PerlAccessHandler) to enforce the block.
The point, though, is that these computations are a drop in the bucket because of all the other information Slashdot is collecting about each hit:
Slashdot’s resource requirements are actually a lot higher than this, since we log every hit instead of just RSS, we log the query string, user-agent, and so on — and also because we’ve voluntarily taken on the privacy burden of MD5’ing incoming IP addresses so we don’t know where users are coming from. That makes our IP address field 28 bytes longer than it has to be. But even so, we don’t have performance issues. Slashdot’s secondary table processing takes about 10-15 seconds every 2 minutes.
So if you’re Slashdot, and you’re already spinning your database hard for each HTTP fetch, performing RSS blocking is no big deal. Jamie agrees that it still has issues with IP address uniqueness (and points to this subthread of the recent RSS thread on /. for disgruntled users), but it definitely appears to be a workable stopgap solution for websites with big iron and fast databases.