Can Big Data Be Poisoned by Botnets?The Chameleon botnet is generating billions of ad impressions in a click-fraud scheme. It's a red flag for companies making business decisions based on analysis of Web data.
A small botnet working covertly from 120,000 American households is generating so much fake Web traffic that it's having a significant impact on the online display advertising industry's overall revenues for the first time, according to the U.K.-based Web analytics company that discovered it.
Web traffic analyses from Spider.io shows that the Chameleon botnet uses its base of bots to blast page requests to more than 200 advertising-supported Web sites.
- 5 Big Data Best Practices - And How To Optimize Your Architecture For Them
- Getting the Most out of Your Windows Upgrade
White PapersMore >>
- IBM System Storage Interactive Product Guide: Intelligent, efficient and automated storage for your IT infrastructure
- How the Internet of Things Will Change Your Business
Each hit raises the total volume of traffic to target sites, which make an average of 69 cents per thousand impressions served. The botnet generates more than 9 billion fake ad impressions per month, accounting for "at least" 65% of the traffic on targeted sites at a cost to advertisers of $6.2 million per month," according to Spider.io.
That is vastly higher than the amount of click fraud on most sites, which averages about 0.16%, according to a March 2012 report from digital marketing consultancy comScore.
Why is all this important to people who run data centers? Because Chameleon is a red flag to any company that analyzes Web traffic to identify the interests and activities of its customers, and to make business decisions based on those analyses.
Squirreling those anomalous results into Web-traffic data and, potentially, other streams of information such as the M2M feeds that make up the "Internet of things" could affect the integrity and credibility of results based on that data--especially for companies that pride themselves on the quality of their data, the sophistication of their analytics and the reliability of the projections they use to plan their future businesses.
Chameleon also represents a step up in sophistication of click-fraud schemes, both in its ability to camouflage itself and in the amount of money it generates.
For instance, a separate botnet, the Bamital network, included as many as 1.8 million PCs that delivered an average of 3 million fake clicks a day on specifically designated ads on search sites. But it was responsible for only about $1 million worth of fraudulent ad clicks, according to the U.K.'s Guardian newspaper.
By contrast, advertisers paid ghost sites $6 million per month as part of the Chameleon scheme, and the Chameleon botnet is just one-fifteenth the size of Bamital. This is a major step up in the effectiveness and sophistication of fraud based on high volumes of fake traffic.
Chameleon also takes steps to make its fake traffic look real. Display ads are posted according to the decision of algorithms created by ad network owners to look for the ideal audience. Ad network owners do try to detect, where possible, anomalies in a website's traffic that could indicate click fraud by botnets or other means.
Chameleon bots don't just send page requests; they also generate click traces that make it look as if a user is actually clicking on links, rather than just sending page requests. Those fake mouse clicks generate a 0.02% click through rate, and paint mouse traces on 11% of all the fake ad impressions the bots generate.
Spider.io posted a list of 5,000 IP addresses that can be pasted into blacklists to block the worst of the Chameleon bots.
The 202 sites that benefited from Chameleon visits are mostly so-called "ghost sites," whose URLs look like ordinary consumer sites but that contain minimal content and are owned by ad an network called AlphaBird, which may or may not have known about the volume of fraud, according to PaidContent. PaidContent is a media blog owned by GigaOM.
Botnets have plagued data center operators for years as a source of spam and DDoS attacks. This latest twist has implications for data analysis. Garbage in, garbage out--remember? The first thing programmers have to learn is how to keep the garbage out of their data and their results. Chameleon shows garbage is getting a lot sneakier about getting into data that looks perfectly good.