tweetspam —

How Twitter’s new "BotMaker" filter flushes spam out of timelines

Sifting spam from ham at scale and in real time is a hard problem to solve.

Spam be gone!
Enlarge / Spam be gone!

To work at Ars is to interact constantly with Twitter, both as a source for developing news and also as a way to goof off with coworkers and other tech journalists (folks who follow the Ars staff on Twitter should be more than familiar with our long-winded late night multi-Tweet antics). But as with any electronic medium, spam on Twitter is a nagging problem—Twitter’s real-time messaging means crafty spammers can blast their messages out to large numbers of people before getting hammered by spam reports.

However, several months back, Twitter went on the offensive against spammers, rolling out a set of anti-spam features collectively referred to as "BotMaker." In a blog post today, Twitter explained that the various components of BotMaker have been operational for about six months, and in that time Twitter has recorded a significant drop in tweetspam—up to 40 percent by its internal metrics.

Twitter’s real-time nature poses trouble for a traditional monolithic spam-checking system that might add many seconds onto the delivery of a tweet to followers. Rather than maintaining such a monolithic system (something akin to SpamAssassin, a widely deployed e-mail anti-spam application), Twitter’s BotMaker lets Twitter engineers quickly establish simple sets of conditional rule-based actions (which they call "bots"—hence "BotMaker") and apply them to tweets both during and after the posting process.

Diagram of the BotMaker components and how they fit into Twitter's processes.
Enlarge / Diagram of the BotMaker components and how they fit into Twitter's processes.
Twitter

It’s the selective application of rules, rather than a big all-in-one solution, that lets BotMaker function in a way that’s transparent to Twitter users.

"Real time" tweet checking uses a BotMaker component nicknamed "Scarecrow." Scarecrow is a low latency synchronous component of the Twitter posting process, meaning that a tweet can’t proceed down the posting path until Scarecrow finishes processing it. When write events come in from clients, Scarecrow parses the contents against its current set of rules and can either pass the tweet on, challenge the posting client with a CAPTCHA, or deny the tweet.

But Scarecrow, being synchronous, only has milliseconds to do its job before it starts to impact Twitter’s realtime nature. An asynchronous tool named "Sniper" fills in when Scarecrow can’t get the job done in time, applying more tests to tweets after they’ve been posted. Tweetspam that sneaks past Scarecrow for one reason or another has a second chance at detection by Sniper, which uses more complex "machine learning models [which] cannot be evaluated in real time."

Thirdly, Twitter runs periodic jobs that look at data over a longer term, applying tests that require large amounts of processing and data. However, as the company points out in the blog post, doing all spam detection offline isn’t practical—the entire point of keeping spam off a service like Twitter is to be able to do it live.

The blog post dives further into describing the actual syntax of the rules used by BotMaker, emphasizing that the tool uses human-readable syntax and that rules can be deployed across the entire Twitter network in seconds, without any kind of recompilation. This type of fast response is often necessary to counter large-scale spam attacks, which can take advantage of Twitter’s API and shovel out huge numbers of automated tweets.

What about those bots?

The topic of bots on Twitter made some headlines last week, too, when an SEC filing by Twitter appeared to demonstrate that tens of millions of Twitter accounts are run by bots instead of humans. The filing specifically stated that "up to approximately 8.5% of all active users used third-party applications that may have automatically contacted our servers without any discernible additional user-initiated action."

The implication is that these "third-party applications" are spambots auto-posting things, but Twitter quickly clarified that's not the case: these applications are mostly programs that can concatenate different social media feeds into a single notification area. Ars reached out to Twitter for some additional information, and a spokesperson told us that Twitter is perfectly fine with these kinds of aggregation tools, and BotMaker is designed to allow them to continue working while at the same time stepping on both manual and automated spam accounts.

Actions a-go-go

Interestingly, the blog post closes by noting that BotMaker isn’t just a spam-fighting technology—it’s a "fundamental interposition layer" in the distributed Twitter network. Twitter’s network is essentially a massive event processing system, and the company expects to use BotMaker and its ability to poke fingers in real time into massive event-based systems for lots of other things.

Now that Twitter is a publicly traded company (with $250 million in revenue reported for the first quarter of 2014), the most obvious non-spam use of BotMaker would be for better targeting of advertisements. Being able to quickly and accurately deliver ad inventory to Twitter users based on real-time tweet processing would potentially allow Twitter to charge advertisers more money for its customers’ eyeballs.

Channel Ars Technica