"Storm’s architecture uses distributed remote procedure calls, so as you run a processing topology, it implements the RPC function and waits for RPC invocations," says Biddick. "An RPC invocation is a message containing the parameters of the RPC request and information telling Storm where to send the results. The topology picks up messages, does the necessary computations in parallel on several machines and returns the results to the request originator."
He says Storm’s distributed, fault-tolerant approach operates at a higher level of abstraction than message queues. Yahoo’s S4 and Amazon Web Services take similar approaches, Biddick adds. And AWS is developing a stream processing capability that it says will process more than 2 million records per second at launch and eventually will scale to handle more than 100 times that traffic. The company describes the platform as providing near-real-time, highly available and reliable data processing.
Another issue companies need to think about is the ability to access big data--and quickly. "Before thinking about big data architectures, make sure your data policies are clear and accepted throughout the organization," advises Biddick. "They must define the types of data that will be stored, for how long, how quickly you need to access it, and how it will be accessed. These policies will form the basis of storage governance and help define your technology requirements."
Without this foundation, he says, companies will just be throwing storage dollars at problems and end up with a depleted budget, underutilized technology and an inability to plan for future growth. "Big data management," says Biddick, "is challenging enough without worrying about whether you’re managing the right data set."
Learn more about Research: The Big Data Management Challenge by subscribing to Network Computing Pro Reports (free, registration required).