Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Build The Infrastructure For Big Data's 3 Stages

  • Much of the narrative around big data and analytics focuses on the "magic" that happens when sufficiently sophisticated software heuristics are applied to sufficiently large and useful datasets. Some writing on the subject almost gives the impression that you just have to wave a wand over your data and say "Hadoop!" to get insights that will transform your business. 

    But software doesn't happen without hardware. And not all datasets or analytic techniques are created equal. So for companies to successfully apply big data analytics to diverse business challenges across the enterprise, they need infrastructure that can:

    1. Perform capacity intensive operations at sufficient speed to meet end user needs, and
    2. Support a diversity of use cases with significantly varying workload profiles.

    Oh, and one more thing: IT has to build the infrastructure for an analytics pipeline without breaking the bank.

    An enterprise-level strategy for big data infrastructure is especially important because analytic initiatives are often very fragmented. Marketing embarks on a project here; compliance embarks on another one there. If each of these groups build their own analytic environments, there's bound to be a lot of redundant spending -- and a lot of redundant performance/capacity problem solving.

    That's why it typically makes sense to construct a shared-service pipeline that leverages a common set of infrastructure resources across the business while still providing the flexibility to address analytic workloads with highly variable characteristics.

  • Pipeline stage 1: Data intake and rationalization

    The first important component of any big data pipeline -- and one that is often neglected because of all the attention given to analytic processing -- is the pre-analytic processing of diverse data sets. Sure, in some analytic models, massive volumes of heterogeneous data can simply be dumped into a large "data lake" with little or no pre-analytic processing. But in many cases, it is extremely important to appropriately validate, de-duplicate, and otherwise rationalize incoming data. In fact, the accuracy and usefulness of analytic results on the back end of a big data pipeline is often largely contingent upon how well incoming data has been cleaned and screened on the front end.

    The amount of I/O and CPU capacity this work requires is non-trivial. In fact, in many cases, data intake and rationalization can be more infrastructure-intensive than the subsequent analytic processing itself. And, while intake and rationalization workloads can vary significantly by use-cases attributes such as the number of data sources and total volume of data, infrastructure strategists can typically count on certain commonalities. For example, a company with a significant percentage of its operational data on a mainframe will have probably have to execute a lot of EBCDIC-to-ASCII conversions as part of intake/rationalization.

    Big data success on an enterprise-wide basis thus requires a data intake and rationalization engine that is sufficiently big and fast to handle the analytic use cases that have to largest and most diverse datasets.

  • Pipeline stage 2: Analytic operations

    Once data has been properly conditioned, it is ready for analytic processing. Here is where workload attributes can vary very significantly. Some highly experimental analytic approaches take narrower datasets and put them through extremely exploratory analytics designed to discover literally any patterns and anomalies of note. It is then left to data analysts and business stakeholder to determine how those patterns and anomalies may relate to the business.

    Others follow a well-defined analytic method that has already been established through experimentation to deliver a specific set of pre-defined analytic results to business users. Examples of these include multi-dimensional purchasing analyses based on pricing, demographics, and geographies -- or executive dashboards that provide color-coded risk ratings across various specific operational categories.

    These variations in analytic method translate to different requirements for CPU, I/O, memory, and other infrastructure parameters. Those parameters may vary somewhat over time as, say, historical datasets become progressively larger. But the requirements for each of those parameters relative to each other tends to maintain a certain profile based on analytic method.

    In other words, extremely I/O-intensive methods tend to remain distinguishable from more CPU-intensive methods even as data volume fluctuates incrementally.

    There are also obviously differences in how infrastructure capacity is consumed by real-time analytics that require continuous processing and how it is consumed by batch jobs that run daily, weekly, quarterly, etc.

  • Pipeline stage 3: Visualization

    The third and last stage in the big data analytic pipeline is visualization. This stage also tends to get overlooked, because many people assume that analytic results are inherently consumable by end users.

    But they're not. Raw analytic results, in fact, can be completely indecipherable by human consumers of information -- and often require fairly intensive visualization techniques to have any business utility at all. This is especially true for trends and anomalies that have multi-dimensional characteristics that can only be understood through techniques such as 3-D rendering, bubble-charting, radial trees, and the like.

    While visualization is not nearly as infrastructure-intensive as intake and analytic processing, it does require a non-trivial capacity -- especially if use cases entail giving end users capabilities such as model rotation, zoom-in, and drill-down. Also, utilization levels for visualization tend to be somewhat independent of underlying use case attributes.

    For example, a relatively simple analytics use case run as a bi-weekly batch might require sophisticated visualization that a many users in multiple locations access constantly through their workday. Conversely, a processing-intensive use case may generate a relatively simple dashboard that is primarily used by a single high-level decision-maker.

  • What should you build?

    Given these significant variables, it may seem daunting to try and construct a single infrastructure pipeline capable of handling all big data analytics use cases across the enterprise with performance that fulfills end user expectations. And it probably is.

    Between the extreme of building a single monolithic pipeline for every use case and the extreme of having every use case consume infrastructure independently, however, are a number of reasonable approaches. These approaches typically involve some limited number of pipelines that are aligned with certain categories of workload.

    For example, some IT organizations have found it practical to build two primary pipelines. One is reserved for experimental workloads that require lots of data sources and highly exploratory analytic operations. The other is dedicated to more clearly defined analytic operations.

    This approach has two advantages. The first is that each pipeline can be better tailored to their respective workloads profiles. The second is that it allows dataops teams to focus on different pipeline goals. With experimental workloads, those goals tend to revolve more around appropriately filtering data inputs and improving accuracy of results. With more established workloads, end users typically look for snappier performance and richer visualization. Segmenting analytics across the experimental and established dimension thus allows for more efficient use of infrastructure and staff time.

    There other ways to taxonomize analytic pipelines, as well. Some IT organizations find that segregating real-time from batch is very practical, because they can schedule batch jobs in ways that minimize their overall infrastructure requirements. Others find it necessary to segregate pipelines leveraging data subject to certain compliance constraints from those that don't.

    Whichever approach -- or combination of approaches -- makes sense for your organization, the key is to both improve service levels and drive down costs by achieving economies of scale across as many different analytic use cases as possible.

    (Image: Nostal6ie/iStockphoto)

  • The role of the cloud

    Any contemporary discussion of IT infrastructure must, of course, address the cloud. This is especially true in the context of analytics, since the cloud allows for on-demand access to additional infrastructure capacity in order to accommodate spikes in workload requirements.

    Several cautions are in order, though, when considering the use of XaaS resources in conjunction with big data operations:

    • The cloud is not free. While cloud services enable IT organizations to avoid capital costs -- and often enable them to obtain "like-for-like for less" -- there can still be significant costs involved. So use of the cloud does not justify a failure to consolidate analytic workloads.
    • Spikes in utilization can be severe. Because cloud services typically entail a usage-based cost model, they are very vulnerable to extreme unanticipated cost overruns. Without proper governance, LOBs can run up bills that significantly undermine ROI for pervasive big data analytics.
    • Spinning up is easier than spinning down. As most of us have discovered, in both public and private cloud environments, it's very easy and convenient to spin up VMs and other resources when a capacity crunch hits -- but we tend to be less diligent about de-activating that capacity after the crunch passes. This is definitely a danger when it comes to analytic pipelines.
    • Compliance chaos. As the use of big data analytics becomes more pervasive across the enterprise, the risk of having the wrong data end up in the wrong places increases. So IT has to be careful about ensuring that compliance-restricted data doesn't wind up on a non-compliant cloud -- and that all uses of that data can be fully accounted for to regulatory auditors.

    (Image: djvstock/iStockphoto)

  • A big data pipeline checklist

    Based on the preceding overview, infrastructure strategists should consider the following best practices for unifying analytics pipelines across the enterprise:

    1. Inventory all internal and third-party datasets likely to be included in analytics use bases by size, type, and data attributes that impact intake and rationalization (i.e., quality, duplication, masking/encryption requirements, etc.). Assess current and planned use cases to understand both extreme and typical data intakes.
    2. Inventory current and planned analytic operations to assess workload profiles by scale, CPU and/or I/O intensiveness, etc., with an eye towards commonalities that can provide a basis for shared infrastructure.
    3. Survey end users regarding their performance expectations. Look for correlations between those expectations and use case categories in order to allocate infrastructure where it's most needed.
    4. Map use cases to business value to ensure close alignment of infrastructure investments with potential financial returns.
    5. Assign fully burdened costs to both internal and XaaS-based infrastructure costs that include all management, monitoring, security, and compliance reporting operations.
    6. Consider other ways to reduce general pipeline costs. For example, a better enterprise approach to mobile device management may reduce the volume of data rationalization operations that have to be performed on every individual dataset for every individual analytic use case.

    With the right data and the right analytic tools, businesses can better anticipate customer desires, more aggressively optimize supply-chain spending, and more easily detect potential fraud. But no business has an infinite budget for these analytic operations. Every IT architect should therefore look for ways to conserve both the capital and operational costs associated with the diverse analytic pipelines required across today's enterprise.

    (Image: adventtr/iStockphoto)