Validator foiled!
A couple of years ago I worked on a TrackBack Validator which identified and rejected TrackBacks posted on your blog from sites that didn’t actually link to your blog.
In our 2006 tech report on the subject (co-authored by my advisor and a number of undergrads in his computer security class), we speculated that—given sufficiently widespread use of inbound-link validation—spammers would be forced to either (a) close up shop, moving on to some other exploitable technology, or (b) start actually linking to their victims. To wit:
cite="http://seclab.cs.rice.edu/proj/trackback/papers/taking-trackback-back.pdf">Spammers who wish to overcome our mechanism
are forced to indefinitely maintain reciprocal links from their own
web sites, effectively increasing their necessary investment of time
and resources. Furthermore, the spammer’s site, by linking to its
victims, will actually benefit the victims’ search engine rankings by
sharing part of the spammer’s ranking with each of its victims.
Best of all, if the spammer is effectively publishing a list of its
victims, that list would provide compelling evidence that could be used
against the spammer in legal proceedings.In the limit, we are effectively pushing spammers to run
“legitimate??? weblogs. If spammers’ weblogs are following the TrackBack
protocol correctly and are legitimately providing reciprocal links,
then we face a more fundamental question: is such a TrackBack message
actually spam? If a “real??? blog is linking to the victim, regardless
of any spam-like content it might contain, then the
TrackBack the victim receives could well be defined as “legitimate.???
At that point, the issue is not one of spam vs. non-spam, but rather
one of relevance.
Well, we were right and not right. I just received some TrackBack spam (probably not coincidentally, on a
blog post about trackback spam) that fooled the Validator and yet can’t really be considered to be legitimate.
The inbound link is included, but hidden from the user with CSS tricks! Here’s an excerpt of the source of the page:
<style type="text/css" media="screen"> .trackback { position:absolute; top:0px; left:0px; visibility:hidden; } </style> <div> <div class="trackback"> [...] <p> [...] far out site now comment this synopsis <a href='http://dsandler.org/wp/archives/2005/11/14/trackback-spammers-upping-the-ante'>http://dsandler.org/wp/archives/2005/11/14/trackback-spammers-upping-the-ante</a> and give comments [...] </p>
As you can see, all the inbound links are surrounded with irrelevant content, but what’s more, they’re children of the <div class="trackback"> and hence invisible to readers. In our paper we point to readers as one of two “last resorts??? to help weed out irrelevant but otherwise Validated TrackBacks; obviously they won’t be able to help here. (The other technique, which would still work in this case, is the same sort of statistical classification currently used for email; see §5 of the TR for details.)
In the end, this “break??? of the Validator may not yield much for this spammer aside from the satisfaction of successfully defacing my blog. Google has been known to apply a PageRank penalty to websites with large regions of hidden text, so the currency gained by inbound links may very well be more than offset. What’s more, like most modern blogs and CMSes, dsandler.org applies rel="nofollow" to any links found in comments or TrackBacks, so the spammer gets zero Google-juice in this situation.
But since spam is so cheap, the spammer probably doesn’t care. That’s why the Validator was so important: it proved remarkably effective at reducing the “collateral damage??? of spam, namely, blog defacement. In order to continue to be effective against this sort of attack, it would probably need to include some sort of CSS/DOM interpreter.
(Yuck.)
For more on all these icky edge-cases in TrackBack (and other forms of Web) spam, read the report. (It’s just a six-pager.)