Open-source Ceph and Red Hat Gluster are mature technologies, but will soon experience a kind of rebirth. With the storage industry starting to shift to scale-out storage and clouds, appliances based on these low-cost software technologies will be entering the market, complementing the self-integrated solutions that have emerged in the last year or so.
There are fundamental differences in approach between Ceph and Gluster. Ceph is at base an object-store system, called RADOS, with a set of gateway APIs that present the data in block, file, and object modes. The topology of a Ceph cluster is designed around replication and information distribution, which are intrinsic and provide data integrity.
Red Hat describes Gluster as a scale-out NAS and object store. It uses a hashing algorithm to place data within the storage pool, much as Ceph does. This is the key to scaling in both cases. The hashing algorithm is exported to all the servers, allowing them to figure out where a particular data item should be kept. As a result, data can be replicated easily, and the lack of central metadata files means that there is no bottleneck in accessing, as might occur with Hadoop.
Ceph and Gluster have similar data distribution capabilities. Ceph stripes data across large node-sets, like most object storage software. This aims to prevent bottlenecks in storage accesses.
Because the default block size for Ceph is small (64KB), the data stream fragments into a lot of random IO operations. Disk drives can generally do a maximum number of random IOs per second (typically 150 or less for HDD). Just as important, that number doesn't change much as the size of the transfer increases, so a larger IO size will move more data in aggregate than a small block size.
Gluster uses a default value of 128KB. The larger default size is the primary reason that Red Hat claims to outperform Ceph by three to one in benchmarking tests. The results are an artifice of configuration and setup. The testers could have used a little bit of tuning to bring them close together. Ceph can change chunk size from 64KB to 256KB or even 1MB, and doing so would probably have given Ceph the performance edge.
The art of benchmarking is complex. Enough said. The decision on transfer sizes could itself account for Ceph running faster or slower than Gluster. We can only honestly measure performance is through an independent third party, with tuning input from both teams. This hasn't happened yet, and Red Hat's report is misleading.
We must look at scale-out performance. Both systems avoid single-node metadata, and so should scale nearly linearly. Data deduplication should not be very different in performance. Compression at the server makes equal sense, too, reducing both storage space used and network traffic and lowering the amount of disk IO needed for each file.
Ceph file journals can write to SSD, which speeds up performance significantly. Caching or tiering is supported, allowing flexibility and economy in configurations.
Ceph has an advantage in recovering failed disk drives. Because data is distributed over larger node-sets than Gluster, many more drives are able to input data from the replica copies in parallel. This reduces rebuild time, while not loading down any one drive. In large clusters, this is a significant issue.
Installation and management is easy for both technologies, but planning for a good long-term deployment can take some time. Storage managers will find Ceph with Inktank to offer a more sophisticated approach because it carries file system, block access, and remote replication as intrinsic functions, rather than add-ons (as is the case with Gluster). This gives Ceph a major advantage, and may be why it is leading Gluster in installs. It eases migration problems for blockIO and provides management of a single storage pool.
Still, both technologies provide strong options at reasonable prices. The base code is open-source and free, but Inktank and Red Hat offer support licenses and management tool packages. Compared with traditional storage, Ceph and Gluster provide good value, since the underlying hardware in both cases is inexpensive off-the-shelf gear, with commodity-priced drives.
With good feature sets and decent performance at an excellent price point, both Ceph and Gluster provide a viable alternative to expensive proprietary storage. They are sure to gain market share and may ruffle the feathers of industry stalwarts like EMC and NetApp.
Jim O'Reilly is a former IT executive and currently a consultant focused on storage and cloud computing.