In an earlier blog post, I discussed some architectural options in deploying a big data environment, including cloud vs. on-premises, and dedicated vs. shared infrastructure. In this post, I'll examine topics that may be even more divisive: open vs. proprietary software and commodity vs. purpose-built hardware. These choices seem to reflect personal philosophies as much as technological differences.
Don’t think so? Try this: Quickly explain your justifications for sending your children to public school or private school. Or consider your preferences between public transit or driving your own car. Do you pay for music versus streaming or download it for free? We have some opinions on these choices, don’t we? The same emotions seem to come into play when making decisions about the model of hardware or software used for big data.
Open source software is cool. I love the idea of developers working away for the betterment of all mankind, not expecting anything in return beyond an occasional complimentary comment on a clever bit of code. The Apache Software Foundation does a brilliant job of guiding collaboration on major software projects. Sites like Github promote free sharing, and do this so well you forget they are actually commercial enterprises. Without popular open source data platforms like Hadoop, Cassandra, and MongoDB, it’s arguable that the big data market simply wouldn’t exist as it does today.
Many developers and architects are rightly passionate open source software, and will naturally seek to stay “pure” to the approach. Sometimes this feeling is about the pragmatic democratization of free software available for anyone interested, but often it’s a philosophical bias.
Yet open source software still hasn’t driven proprietary independent software vendors out of business. If anything, it’s created more opportunities for them to thrive. Just as open source is great because it belongs to everyone, it’s a risk for those who need someone to “own” the responsibility of enterprise support and services.
Many companies aren’t looking for big data to remain a small pilot, instead they want to democratize in another way by bringing big data applications to far more of their staff. The more people are dependent on these new platforms, the more important it is that the environment is robust, predictable, and reliable. Most businesses want to be able to call for help and have guarantees that the software will meet their operational demands.
Cloudera, Hortonworks, MapR, and Pivotal all support Hadoop, while MongoDB Inc., Datastax for Cassandra, and Databricks for Spark support other popular software. These companies are winning over many large enterprises as a result. And they all enhance the core open source components with differentiated, proprietary tools to improve performance, manageability, and security. Most businesses are quite happy to pay for a vendor’s help and improvements.
Commodity vs. purpose-built
The decision between commodity and purpose-built hardware may be more nuanced. In fact, it’s worth explaining what I mean by “commodity: a relatively standard piece of kit, for which one could easily buy an equivalent unit from another vendor or even build from scratch parts. Please note that commodity does not necessarily mean low-cost or low-end equipment.
In the big data space, this is usually the basic server node with its embedded storage and networking ports. Commodity servers are most often used in the scale-out model, which can become massively parallel processing (MPP), but may begin with just a few nodes. Other parts of the technology stack also may be commodity hardware, like the network switches . Hyperconvergence is starting to push these elements even closer in a single unit. The fundamental advantages of commodity hardware are the easy scalability and interchangeability of nodes, and perhaps the price negotiation power that accompanies these characteristics.
On the other side of this coin are purpose-built systems, called appliances or engineered systems. These may have a higher starting price per unit, but also may offer advantages of better performance, unique functionality, and already be integrated as a more complete stack of hardware and software. This latter feature saves a lot of effort and can deliver additional assurance that the whole system will work well, without blame games between different vendors when something goes sideways.
Many hardware vendors sell both commodity and purpose-built systems for big data, including Dell, HP, Oracle, and IBM. Some vendors, such as EMC/Isilon, IBM, HDS, and Cray, also suggest that the server and storage elements should be separated for independent scalability and centralized multi-protocol access. Customers will often choose to pay a premium for well-designed, proven systems, rather than having to perform the integration themselves and for the benefit of having a single-provider for support.
For many customers, big data deployment choices aren’t necessarily “either/or” decisions. A complete big data environment will have many moving parts and it’s possible to combine different options as desired for each segment of the technology stack. What’s most important is that a conscious approach is taken in evaluating the ideal design for their enterprise requirements. The justification should be better thought out than just saying, “We’ve always done it this way.”