Keeping data stored in a hybrid cloud requires careful planning. Here's what to keep in mind.
For most IT professionals, the cloud is today’s greatest challenge and also the greatest opportunity. Moving mission-critical tasks into the cloud is still viewed as risky, both from a loss of control point of view and -- let’s be honest -- from a job security perspective. This is leading companies to hybrid cloud solutions, where the public cloud complements a private cloud, which might either be in-house or hosted.
Learning how to build a hybrid cloud is an issue in its own right, bringing new skills challenges to the game, but the toughest problems are in cloud storage: Where to put it, how to move it around, how to protect it from the black hats, and how to ensure adequate performance.
Where data is stored and how to sync it across the clouds are closely related issues. We don’t live in an ideal world. WAN links are generally slow and telco continue to resist the roll-out of fiber. This means we need to segregate and manage storage datasets much more carefully, while looking for alternatives to brute force stacking of WAN links. This issue may be the greatest inhibitor of hybrid cloud deployments.
Alternatives worth exploring are data storage at a third-party datacenter with dedicated fiber links to major public clouds, such as telcos can provide, or the use of data compression on all communications between clouds. But the first step in addressing the issue is to adopt a data-centric posture that determines what data is static or quasi-static, what is dynamic but can be handled asynchronously and, finally, what must be kept in sync to all users. Hopefully, the last category is small.
Cloud storage security is a massive issue, but it’s not just a cloud problem anymore. There’s a real reluctance to secure data properly at the source. Disk-based encryption is inadequate, but the issue is that servers aren’t yet equipped to encrypt at line speed. The industry will fix this in the next couple of years.
Performance is a bit of a mystery in clouds, which use shared IO exclusively. Instances that use solid-state drives will help, but there is still a premium for these. Private clouds have a lot more flexibility in balancing performance between compute, networking and storage and are roughly on par with in-house virtual clusters.
Cost of course is the major consideration for cloud storage. Using data compression can save a lot of cost, especially for colder data.
Building storage for a hybrid cloud
Cloud storage is reliable, whether block or object, but only if you pay attention to geographical zoning. Don’t keep your eggs in one zone! When selecting instances and storage, public clouds have many alternatives and your choice has to balance cost and performance. One issue is that there is a transfer cost involved in accessing data, but not for initial loading of data, so avoid retrieving large datasets.
Private cloud storage also needs careful consideration prior to building. OpenStack has block and object models, but Ceph is a strong alternative for object storage. The one you choose may depend on what integrator you use, since vendors have their preferences.
Pre-integrated vs. custom
We are seeing complete cloud storage solutions pre-packaged for both OpenStack and Azure, but these target an average use case and a better solution might be to specify your own storage units. There are Ceph appliances entering the mainstream, with a lot of flexibility in configuration. These will be inexpensive, especially if white box, but self-integration needs some systems and tuning expertise.
An alternative is buy servers and follow the virtual SAN model. These are available as hyperconverged systems, but again, buying software and a white-box hardware gives you much more flexibility while saving cost.
The key to a good hybrid cloud deployment is solid storage management. Not all data is equal: We have hot and cold data. Archives, backups and other cold data are easy to manage. Store them compressed in an object store in at least two clouds – public or private -- with the archives in the cheapest storage tier the cloud provider offers.
Hot data needs active storage and not all storage is equal. Picking SSD instances improves performance, but at a price. Determining if the data has to be always synced across the clouds is crucial, since it implies a primary copy in the in-house cloud that must always be updated on write. For performance purposes, secondary copies in the public clouds work well, but they too must be synced atomically.
(Image: Matej Moderc/iStockphoto)
While the myth that 24 Mbps was the fastest WAN link the IT industry needed has been utterly busted, WAN speed is the biggest cloud storage issue. From initial load to sync and migration, slow links are a pain. There are bulk loading options, some requiring shipping drives or tapes, but this implies keeping daily operations traffic to a minimum. There is no way to upload a complete ERP system on the fly for cloud-bursting, as an example.
To minimize startup time for new instances, data has to be classified, with an objective to maintain as much replication as possible in a public cloud. Quasi-static data can reside in both clouds, while it's a good practice to limit hot synced data as much as possible. Larger organizations should formulate policies for this.
Co-locating data at a telco is one compromise worth considering. It will minimize latency to either cloud so, combined with good data management, might speed up operations overall.
Public clouds are at least as secure as in-house operations. Both are vulnerable if data isn’t encrypted, but using cloud drive encryption isn’t adequate. Best practices require encrypting data before it leaves the server, protecting data in transit as well as at rest.
The downside from all this encryption is a performance hit. This will get better as next-gen processors with support for encryption arrive in the next two years. Work flows in both public and private clouds will need to include decode steps.
Data governance is another major problem. Data sprawl is easier in clouds and keeping track of all datasets is crucial, together with lifecycle policies.
Running a cloud well requires some new skills. We’ve noted systems and tuning expertise already, and obviously any software stack implies training. Data management is a crucial skill to add to the IT portfolio. On the security side, encryption and other security skills are necessary, while setting up and auditing data governance systems is valuable.
Clouds have their own ways of setting up networks and associated security, which is becoming more automated as we move towards policy-driven approaches such as software-defined infrastructure. Software-defined storage will heavily impact cloud designs a year or two from now, so that’s an expertise to build.
Public clouds are built with the latest, lowest-cost COTS storage gear. This, combined with their huge buying power, makes cloud service providers super-aggressive on pricing. White-box systems hitting the market are redressing the balance by making in-house clouds cheaper, too. The trade-off in costs between public and private usage will need constant monitoring, though we can expect the downward trend for computing costs to continue.
SSDs are closer to replacing HDDs than most people think. This changes the cost/performance equation in systems in radical ways, meaning generally more work done on many fewer servers. Public clouds will likely drive this faster than the private sector, impacting the balance between the two for expenditures.