One of the challenges of big data is real-time processing, especially in dynamic data environments such as financial trading and social media, Biddick says. "Many queries are difficult to pre-compute and too intense to compute in real time on a single machine. Traditionally, you have to do an approximation to keep the cost of such a query down." He says that Storm, open-source software from BackType, which Twitter bought last summer, does distributed real-time processing of information that enables Twitter users to track trends and figure out how many unique people see a tweet.
"Storm’s architecture uses distributed remote procedure calls, so as you run a processing topology, it implements the RPC function and waits for RPC invocations," says Biddick. "An RPC invocation is a message containing the parameters of the RPC request and information telling Storm where to send the results. The topology picks up messages, does the necessary computations in parallel on several machines and returns the results to the request originator."
He says Storm’s distributed, fault-tolerant approach operates at a higher level of abstraction than message queues. Yahoo’s S4 and Amazon Web Services take similar approaches, Biddick adds. And AWS is developing a stream processing capability that it says will process more than 2 million records per second at launch and eventually will scale to handle more than 100 times that traffic. The company describes the platform as providing near-real-time, highly available and reliable data processing.
Another issue companies need to think about is the ability to access big data--and quickly. "Before thinking about big data architectures, make sure your data policies are clear and accepted throughout the organization," advises Biddick. "They must define the types of data that will be stored, for how long, how quickly you need to access it, and how it will be accessed. These policies will form the basis of storage governance and help define your technology requirements."
Without this foundation, he says, companies will just be throwing storage dollars at problems and end up with a depleted budget, underutilized technology and an inability to plan for future growth. "Big data management," says Biddick, "is challenging enough without worrying about whether you’re managing the right data set."