Deduplication has conquered the backup realm and is quickly becoming a requirement in primary storage. The technology has been a boon to beleaguered storage administrators who have watched storage volumes balloon almost out of control. But once every system has deduplication in some form and the technology is broadly available, what do we do next to optimize storage? After all, storage growth isn't going to stop; files will continue to get larger and the number of files that are created will increase.
Storage optimization can't stop at putting deduplication and compression everywhere. If it does, then once the excess storage capacity is consumed, we will see growth accelerate again. I don't think the answer is strict retention policies that delete files at the end of their lifecycle. As I argue in my "Keep It Forever" series on Information Week, the only way for companies to meet expanding compliance requirements is to essentially save everything forever. So what do we do next?
First, we are going to have to maximize the benefits of deduplication and compression. As the hardware that surrounds these systems gets more powerful, it can do more in-depth inspection of the data being optimized to find more redundancy. There will also be a likely integration of deduplication up and down the storage stack. This will lead to greater overall storage efficiencies by leveraging the same dedupe meta-data for primary storage as you do for backup.
We will also need to advance the state of the software so it can move beyond the limitations that some deduplication engines suffer through. We are either going to need cross-volume and even cross-manufacturer deduplication support to be able to support the deduplication of single, very large volumes. Many deduplication engines today are limited as to how much deduplicated data they can handle. This leads to deduplicated silos that will have duplicate data between them.
There may also be more lossful types of data optimization that become more acceptable to curb data growth. For example, a PowerPoint presentation that is going to be presented on a relatively low resolution video projector doesn't need to be loaded with images that only a high-end photo quality printer can reproduce.
After we have made deduplication as efficient as possible we are either going to need to decide to live with storage growth, or someone is going to have to come up with a new technology to further optimize storage (let's hope a few storage entrepreneurs are wrestling with this issue as I write this). This new technology will need to leverage or compliment deduplication and compression because those technologies will be an embedded part of every storage system. If it doesn't appear, we are going to have to deal with data growth again.
Finding a way to store all of this data is not the problem. Storage systems have been or are being designed to support PBytes of storage and are increasing those capabilities each year. We also have the ability to better link independent systems, so even if capacity of one system is reached we can add another and still manage it to some extent. The challenge is how we fit all the capacity in the data center, and how we power it.
One potential solution that I can see is incredibly dense, power-efficient MAID systems that integrate with the deduplication technology for the ultimate archive. That way, we can store everything we need in the available data center floor space but not have to power it until accessed. The other solution is to send all of this old data to a cloud storage provider and make it their problem, since they are already likely to have a storage system that can scale to hundreds of PBytes.