• 01/27/2014
    8:06 AM
  • Rating: 
    0 votes
    Vote up!
    Vote down!

Gluster Vs. Ceph: Open Source Storage Goes Head-To-Head

Storage appliances using open-source Ceph and Gluster offer similar advantages with great cost benefits. Which is faster and easier to use?

Open-source Ceph and Red Hat Gluster are mature technologies, but will soon experience a kind of rebirth. With the storage industry starting to shift to scale-out storage and clouds, appliances based on these low-cost software technologies will be entering the market, complementing the self-integrated solutions that have emerged in the last year or so.

There are fundamental differences in approach between Ceph and Gluster. Ceph is at base an object-store system, called RADOS, with a set of gateway APIs that present the data in block, file, and object modes. The topology of a Ceph cluster is designed around replication and information distribution, which are intrinsic and provide data integrity.

Red Hat describes Gluster as a scale-out NAS and object store. It uses a hashing algorithm to place data within the storage pool, much as Ceph does. This is the key to scaling in both cases. The hashing algorithm is exported to all the servers, allowing them to figure out where a particular data item should be kept. As a result, data can be replicated easily, and the lack of central metadata files means that there is no bottleneck in accessing, as might occur with Hadoop.

Ceph and Gluster have similar data distribution capabilities. Ceph stripes data across large node-sets, like most object storage software. This aims to prevent bottlenecks in storage accesses. 

Because the default block size for Ceph is small (64KB), the data stream fragments into a lot of random IO operations. Disk drives can generally do a maximum number of random IOs per second (typically 150 or less for HDD). Just as important, that number doesn't change much as the size of the transfer increases, so a larger IO size will move more data in aggregate than a small block size.

Gluster uses a default value of 128KB. The larger default size is the primary reason that Red Hat claims to outperform Ceph by three to one in benchmarking tests. The results are an artifice of configuration and setup. The testers could have used a little bit of tuning to bring them close together. Ceph can change chunk size from 64KB to 256KB or even 1MB, and doing so would probably have given Ceph the performance edge.

The art of benchmarking is complex. Enough said. The decision on transfer sizes could itself account for Ceph running faster or slower than Gluster. We can only honestly measure performance is through an independent third party, with tuning input from both teams. This hasn't happened yet, and Red Hat's report is misleading. 

We must look at scale-out performance. Both systems avoid single-node metadata, and so should scale nearly linearly. Data deduplication should not be very different in performance. Compression at the server makes equal sense, too, reducing both storage space used and network traffic and lowering the amount of disk IO needed for each file.

Ceph file journals can write to SSD, which speeds up performance significantly. Caching or tiering is supported, allowing flexibility and economy in configurations.

Ceph has an advantage in recovering failed disk drives. Because data is distributed over larger node-sets than Gluster, many more drives are able to input data from the replica copies in parallel. This reduces rebuild time, while not loading down any one drive. In large clusters, this is a significant issue. 

Installation and management is easy for both technologies, but planning for a good long-term deployment can take some time. Storage managers will find Ceph with Inktank to offer a more sophisticated approach because it carries file system, block access, and remote replication as intrinsic functions, rather than add-ons (as is the case with Gluster). This gives Ceph a major advantage, and may be why it is leading Gluster in installs. It eases migration problems for blockIO and provides management of a single storage pool.

Still, both technologies provide strong options at reasonable prices. The base code is open-source and free, but Inktank and Red Hat offer support licenses and management tool packages. Compared with traditional storage, Ceph and Gluster provide good value, since the underlying hardware in both cases is inexpensive off-the-shelf gear, with commodity-priced drives.

With good feature sets and decent performance at an excellent price point, both Ceph and Gluster provide a viable alternative to expensive proprietary storage. They are sure to gain market share and may ruffle the feathers of industry stalwarts like EMC and NetApp.

Jim O'Reilly is a former IT executive and currently a consultant focused on storage and cloud computing.


Who's misleading?

Hi, I'm a GlusterFS developer, and I take issue with several of the points you make.

(1) You claim that Red Hat's performance report was misleading.  Perhaps.  Perhaps not.  At least it's *data*.  Part of the reason behind that study is that people were claiming Ceph must be faster with *absolutely no evidence* behind it.  No comparison that I've ever seen or made myself has ever shown that result.  It's a little disingenous to claim that our report is misleading without (a) identifying specific issues or (b) noting that Inktank has failed to come up with their own report showing anything else.

(2) Your own claim about recovery times is itself misleading.  The limitation on how many other disks will be used to repair one that has failed only applies *within one replica set* which is our lowest-level way of distributing data  If you're striping M ways, repair will be from M drives.  If you're repairing N replica/stripe sets at once, those will all proceed in parallel.  Thus, repair at any given moment might be reading from dozens of drives.  Note also that the Ceph repair story only refers to the RADOS layer; there's a whole different kind of repair that has to happen at the metadata level as well, and that might not be as well distributed - hard to know, since even Inktank admits that the file system code isn't ready for prime time yet.  The proof is in the pudding, and I challenge you to find any realistic scenario where GlusterFS's recovery time is more of an issue than Ceph's.

(3) Installation and management is *not* equally easy for the two systems.  GlusterFS has long had a robust single-point-of-control CLI supporting all manner of configuration and maintenance operations.  More recently we've added non-disruptive upgrade.  Ceph has only very recently developed an infrastructure that does *some* of these things in a more elegant manner than editing and distributing text config files by hand.

(4) Citing greater integration for block/file/object in Ceph is also misleading.  The file system metadata is handled in a *separate layer* than RADOS.  The radosgw object-store component is also separate.  Other protocols such as NFS and SMB are even more separate than in GlusterFS, where all of these interfaces (plus block via qemu or iSCSI) use a common "gfapi" interface to the common parts.  We've thought long and hard about how to apply that good old software engineering principle known as modularity to balance integration with maintainability and extensibility.  Again, the proof is in the pudding and I challenge you to find any scenario where Ceph's supposedly greater integration is more of a benefit than a liability.

(5) You claim that Ceph is leading GlusterFS in installs.  Please cite your source for that.  Ceph might or might not lead GlusterFS in people kicking the tires out of idle curiosity, but with the Red Hat name and sales force behind it you can be *quite* sure that GlusterFS leads in actual production-level usage.

GlusterFS and Ceph are a lot more similar than they are different, and more allies than enemies.  We both have a goal to displace both proprietary storage and half-assed point solutions like HDFS being misrepresented as general purpose.  It's good for people to explore the differences between the two, but I'd prefer for that to be done on the basis of actual data rather than empty theorizing based on a cursory look or an Inktank white paper.

Re: Who's misleading?

@anon9497820322, I appreciate your detailed reply. Some comments:

1)  "data" isn't good enough. The benchmark as run has biases that favor Gluster. As stated in the article, just the default blocking is enough to put Gluster in the lead on this setup. IO optimization in the drives adds to the lead. Put simply, Ceph creates a set of randomized 64KB blocks, while Gluster makes contiguous files, at least in this setup.

On the issue of claims that Ceph is faster than Gluster, I believe they stem from the number of drives that operate in parallel in a scaled-out Ceph system versus the narrower set of drives in Gluster. Before you jump down my throat, I'll add I'm not convinced either on this! I think the numbers for random IO will be close, constrained by drive performance limits.

2) I think you said "If we make Gluster spread data like Ceph it will be as fast". Probably correct, but Ceph does it even in default mode, so they take the point.

3) Ceph may be later than Gluster bringing easy management to the picture, but the explanations at are through and easy to follow. Again, they get the point.

4) I guess Ceph read the same modularity books. However, the solution is presented as a whole, well-structured entity and again that's worth a point.

5) Ceph's lead is a consensus of appliance makers I've talked to. You may be right about production installs having Red Hat a bit ahead, or again you may not. I haven't seen any believable numbers on that yet.

Finally, as I said in the article, both Gluster and Ceph are strong offerings and will change the face of storage, so there's something we definitely agree on!.

Re: Who's misleading?

Seems like you're trying to have it both ways on performance, Jim.  On the one hand, you claim that Ceph must be faster because it uses more spindles, but then you say it's unfair when a test makes it . . . use more spindles.  Wait, what?  Let's talk a bit about striping and performance testing to see how absurd that is.

Striping is a tradeoff.  On the one hand, it can improve single-file performance because of greater parallelism.  On the other hand, it can make performance worse because each request to the drives will be smaller, plus overhead from breaking up and recombining requests, managing more active connections, etc.  We've supported striping just about forever, because of those few cases where it improves performance, but it turns out that those cases are few indeed and that's why we also haven't turned it on by default.  That might actually hurt us in a few single-stream tests, but single-stream is the wrong way to measure the performance of a distributed filesystem anyway.  Is there something bad or unfair about having defaults that reflect knowledge learned from running more realistic tests?  Whose fault is it that Ceph suffers from its out-of-the-box configuration making a tradeoff that's bad in a realistic test?  Not the testers'.  Ceph's. 

By the way, it should also be clear that the "narrower set of drives" claim is bogus.  Between the fact that a real test or a real deployment will involve concurrent I/O across many replica/stripe sets, and that we *can* do striping as well, GlusterFS is just as capable of using every single drive in a cluster simultaneously as Ceph is.  It's irresponsible to make assumptions of which "should" perform better without considering the workload, or directly addressing (ideally measuring) the effects of greater parallelism vs. smaller requests etc.  I stand by my assertion that even flawed data is better than no data.  There is *no evidence* that Ceph is or ever will be faster.  There's only speculation, which is not only unsupported by empirical data but doesn't even stand up to a theoretical analysis.

Awarding Ceph two points on performance based on *nothing at all* is egregious enough, but giving them a point for management is even worse.  First, good examples and tutorials on a website are no susbstitue for a true and authoritatively documented single-point-of-control CLI.  Second, neither of those things can make the software magically capable of online upgrade  You either have it or you don't.  Third, the examples and tutorials *aren't actually that good*.  They contradict each other all over the place, even (last time I had to set up Ceph for testing myself) on something as basic as whether to use mkcephfs or ceph-deploy.  Yes, use the new hotness, says one document.  That's not quite ready, says another.  Basic options, such as the one to use an existing filesystem instead of ruining a well tuned system by building new ones, are barely documented at all and only in the most obscure places.  If I hadn't known about them from having set up Ceph multiple times over the years I wouldn't even have found anything suggesting I should look for them.  I'm certainly not going to say the GlusterFS management layer is perfect, but it set the bar that Ceph has been trying to reach and they're just not there yet.

You can keep awarding points to Ceph all you like, Jim.   It doesn't mean anything.  As I've said many times, Ceph is a fine project.  I have the utmost respect for everyone involved.  Nonetheless, anyone who looks at the facts can see that the list of areas where GlusterFS has managed to pull ahead is much longer than the list of areas where the converse is true.  Go ahead and ask your Ceph friends for some real performance numbers.  Ask them about their roadmap for filesystem-independent snapshots, or geo-replication, or encryption, or for that matter ask whether the filesystem metadata part is ready for production yet.  If you look at what they can each do in the real world today, not what they can do theoretically or in the future, I don't think the result would look at all like the picture you've painted.

Re: Who's misleading?

@anon9497820322, perhaps you missed the point that I'm not convinced on Ceph being faster because of more spindles? I agree with you on this one! I'm glad you agree that spindle count and structure will define performance, and that this should be a level playing field.

On the other issues you mention, I'd be happy to have the Ceph camp open up on their position a bit more. This could be an interesting discussion.


Re: Who's misleading?

I agree with Jim that the design and performance tradeoffs would make for an interesting conversation.  Unfortunately it's not a conversation that I (as a Ceph developer) am interested in having on a comment thread--particularly one with this sort of tone.

I've tried to avoid commenting on the Redhat benchmarks because I don't think they advance the conversation in a meaningful way.  Ceph and Gluster are both complex systems whose performance cannot be reduced to a single test.  They do things differently for a reason, and those reasons have many implications--not just on performance but also on things like repair behavior, consistency, code complexity, and user experience.  

Personally, I'd love to sit down in person with a Gluster engineer (maybe someone like Jeff Darcy) to get a better understanding of what Gluster is doing and why so that I can speak credibly about how Ceph's decisions compare to Gluster's.

[Sage Weil, Ceph developer]

Re: Who's misleading?

The idea of general purpose storage is highly misleading.  It makes customers think they can have a Storage Unicorn and it leads to poor performance and architecture:


See her:


I wish we would be 

Re: Who's misleading?

this is what normally happens when the tester fall in love with the product while evalulating it. :-)