Notes on a Nine Year Study of File System and Storage Benchmarking
Posted by Avishay Traeger, IBM Haifa Research Lab, Haifa, Israel, Erez Zadok, Stony Brook University, Stony Brook, NY, USA on July 16, 2009
Driven by a general sense that benchmarking practices in the areas of file and storage systems are lacking, we conducted an extensive survey of the benchmarks that were published in relevant conference papers in recent years. We decided to evaluate the evaluators, if you will. Our May 2008 ACM Transactions on Storage article, entitled "A Nine Year Study of File System and Storage Benchmarking'", surveyed 415 file system and storage benchmarks from 106 papers that were published in four highly-regarded conferences (SOSP, OSDI, USENIX, and FAST) between 1999 and 2007.
Our suspicions were confirmed. We found that most popular benchmarks are flawed, and many research papers used poor benchmarking practices and did not provide a clear indication of the system's true performance. We evaluated benchmarks qualitatively as well as quantitatively: we conducted a set of experiments to show how some widely used benchmarks can conceal or overemphasize overheads. Finally, we provided a set of guidelines that we hope will improve future performance evaluations. An updated version of the guidelines is available.
Benchmarks are most often used to provide an idea of how fast some piece of software or hardware runs. The results can significantly add to, or detract from, the value of a product (be it monetary or otherwise). For example, they may be used by potential consumers in purchasing decisions, or by researchers to help determine a system's worth.
Systems benchmarking is a difficult task, and many of the lessons learned from this article are general enough that they can be applied to other system fields. However, file and storage systems have special properties. Complex interactions between I/O devices, caches, kernel daemons, and other OS components result in behavior that is rather difficult to analyze. Moreover, systems have different features and optimizations, so no single benchmark is always suitable. Lastly, the large variety of workloads that these systems experience in the real world also adds to this difficulty.
When a performance evaluation of a system is presented, the results and implications must be clear to the reader. This must include accurate depictions of behavior under realistic workloads and in worst-case scenarios, as well as explaining the reasoning behind benchmarking methodologies. In addition, the reader should be able to verify the benchmark results, and compare the performance of one system with that of another. To accomplish these goals, much thought must go into choosing suitable benchmarks and configurations, and accurate results must be properly conveyed.






Comment by Mike Young on July 16, 2009 4:53 PM
I have certainly been frustrated by the benchmarking trends in our industry. And I'm probably also guilty of doing some of things you've mentioned. One of the problems with benchmarking, and it goes back quite a few years is the use of proprietary tools and/or 3rd parties. SpecFS testing for NFS performance is one example. Even if you could get your hands on the software, you couldn't publish numbers without a valid license. Then there's Netbench testing. Not everyone can afford to pay for the testing or to put in a true >100 client lab. So, how does one validate or invalidate these bonafide claims?
I've been a huge fan of leveraging open source tools for benchmarking, much to my criticism. The reason is simple. I can provide the tool, the scripts, the methodology, etc. so that anyone can test my particular claim. Further, I tend to use tools that can be utilized across heterogeneous environments and even on others' products. That way we can get a good idea of the relative difference for comparative studies.
Lately, the kind of benchmarking that I like seeing are those that measure concurrency. It's one thing to have really great performance for one thread, but show me how you do with many threads. Of course, methodology and rationale come into play here when you're seriously trying to improve a product. If you create a RAID5 on just 4 disks, then create a bunch of iSCSI volumes on top of that RAID5, don't expect it to perform wonderfully when hooked up to several high output databases. There are too few disks and the abstractions create too much head thrashing. But with a slight re-layout of the volumes and RAID, performance can really pick up. Sometimes, it's not fair when we pick on a product's performance when we stick it in a situation it wasn't optimized for. I too have a habit of trying to design my products around the 90% scenario. And I trust people try to use them for what they're intended. When that isn't the case, I hope they at least allow me a chance to reconfigure and optimize it for the new scenario.
Lastly, I'm not sure how well standards work in this environment. Storage is quite a large industry. I can certainly test performance of my block devices and get incredible results. It's not all that relevant if you're writing to it through my file system. And what about the application that is really the tool that does the writing? Then let's not forget that you're probably doing this through an encrypted tunnel. And for further security, we may be fencing things off into completely isolated processes. Now let's measure the performance when we're doing all of these things. How many concurrent operations can you now sustain? And on how much physical resources and at what costs?
As we continue to abstract the hardware and still call it "storage", it's going to get more difficult to drive standards. Standard tools are one thing. Standard test suites are another.
I hope this makes sense. This is just based on my personal experience with the subject and with requests for standardized results.
Mike Young
CEO, Cachengo
http://cachengo.com
Reply to this comment
Comment by eitan bachmat on July 17, 2009 4:44 PM
There is a simple solution to the complex problem of benchmarking in academic. The solution is to avoid benchmarks altogether if one can come up with any reasonably good reasons why the ideas which are presented in the paper should work. Only if the analysis is too complicated or the ideas are based on a wild hunch, should benchmarks be performed. In any case, the section containing benchmark
results is usually the least interesting part of the paper, the one with the most silly mistakes and the most useless.
The whole idea of putting benchmark results in academic papers comes from the false pretense of system researchers, who think they are scientists. as a system researcher, I consider myself at best to be a bricoleur, in the sense of Claude Levi-Strauss.
We are scientists in the same way that economists are scientists.
A famous economist once said that economic prediction theory makes astrology look good (and we got proof recently). in the same way, storage system benchmarking makes witchcraft look good.
I am saying this both as an academic researcher and as a developer of actual commercially available storage systems
eitan bachmat, Ben-Gurion university, Israel.
Reply to this comment
Comment by Avishay Traeger on July 19, 2009 6:10 AM
Mike,
Thanks for your input.
First, I certainly agree with your approach of using open-source tools and making everything available for use in comparative studies. I think if everyone did at least this we would be in better shape.
Second, I also agree that multi-process workloads are more interesting, because that is what we see in the "real world".
As for your last point about standard benchmarks, I am more split. I think that _good_ industry standard benchmarks are necessary, because companies need one number that summarizes performance of a storage system (this is what customers often look for). Unfortunately, I'm sure we're all familiar with cases where all of the effort goes to performing well on the benchmark, and letting real-world performance suffer. Hopefully this can be avoided if you can make the scope of the test broad enough and generic enough. Or maybe I'm just naive :-). As for academia and research, I agree that standard suites won't work, and that researchers simply need to put more effort into their evaluations. But you are right that with all of the file system and application levels on top of the storage you won't be able to get an accurate performance estimate. I think the next goal that we can strive for as a community, is to be able to compare two systems and say which will be better for a given situation, and by "a little" or "a lot".
Thanks,
Avishay
Reply to this comment