Chris Evans assesses the storage-related fallout from the CPU vulnerabilities.
IT news over the last few weeks has been dominated by stories of vulnerabilities found in Intel x86 chips and almost all modern processors. The two exposures, Spectre and Meltdown, are a result of the speculative execution that all CPUs use to anticipate the flow of execution of code and ensure that internal instruction pipelines are filled as optimally as possible. It’s been reported that Spectre/Meltdown can have an impact on I/O and that means storage products could be affected. So, what are the impacts and what should data center operators and storage pros do?
Speculative execution is a performance-improvement process used by modern processors where instructions are executed before the processor knows whether they will be needed. Imagine some code that branches as the result of a logic comparison. Without speculative execution, the processor needs to wait for the completion of that logic comparison before continuing to read ahead, resulting in a drop in performance. Speculative execution allows both (or all) branches of the logic to be followed; those that aren’t executed are simply discarded and the processor is kept active.
Both Spectre and Meltdown pose the risk of unauthorized access to data in this speculative execution process. A more detailed breakdown of the problem is available in two papers covering the vulnerabilities (here and here). Vendors have released O/S and BIOS workarounds for the exposures. Meltdown fixes have noticeably impacted performance on systems with high I/O activity due to the extra code needed to isolate user and system memory during context switches (syscalls). Reports range from 5%-50% additional CPU overhead, depending on the specific platform and workload.
How could this impact storage appliances and software? Over the last few years, almost all storage appliances and arrays have migrated to the Intel x86 architecture. Many are now built on Linux or Unix kernels and that means they are directly impacted by the processor vulnerabilities, which if patched, result in increased system load and higher latency.
Software-defined storage products are also potentially impacted, as they run on generic operating systems like Linux and Windows. The same applies for virtual storage appliances run in VMs and hyperconverged infrastructure, and of course either public cloud storage instances or high-intensity I/O cloud applications. Quantifying the impact is difficult as it depends on the amount of system calls the storage software has to make. Some products may be more affected than others.
Storage vendors have had mixed responses to the CPU vulnerabilities. For appliances or arrays that are deemed to be “closed systems” and not able to run user code, their stance is that these systems are unaffected and won’t be patched.
Where appliances can run external code like Pure Storage’s FlashArray, which can execute user code via a feature called Purity Run, there will be a need to patch. Similarly, end users running SDS solutions on generic operating systems will need to patch. HCI and hypervisor vendors have already started to make announcements about patching, although the results have been varied. VMware for instance, released a set of patches only to recommend not installing them due to customer issues. Intel's advisory earlier this week warning of problems with its patches has added to the confusion.
Some vendors such as Dell EMC haven’t made public statements about the impact of the vulnerabilities for all of their products. For example, Dell legacy storage product information is openly available, while information about Dell EMC products is only available behind support firewalls. I guess if you’re a user of those platforms, then you will have access, however, for wider market context it would have been helpful to see a consolidated response in order to assess the risk.
So far, the patches released don’t seem to be very stable. Some have been withdrawn, others have crashed machines or made them unbootable. Vendors are in a difficult position, because the details of the vulnerabilities weren’t widely circulated in the community before they subsequently were made public. Some storage vendors only found out about the issue when the news broke in the press. This means some of the patches may be being rushed to market without full testing of the impact when they are applied.
To patch or not?
What should end users do? First, it’s worth evaluating the risk and impact of either applying or not applying patches. Computers that are regularly exposed to the internet like desktops and public cloud instances (including virtual storage appliances running in a cloud instance)) are likely to be most at risk, whereas storage appliances behind a corporate firewall on a dedicated storage management network are at lowest risk. Measure this risk against the impact of applying the patches and what could go wrong. Applying patches to a storage platform supporting hundreds or thousands of users, for example, is a process that needs thinking through.
Start by talking to your storage vendors. Ask them why they believe their platforms are exposed or not. Ask what testing of patching has been performed, from both a stability and performance perspective. If you have a lab environment, do some before/after testing with standard workloads. If you don’t have a lab, ask your vendor for support.
As there are no known exploits in the wild for Spectre/Meltdown, a wise approach is probably to wait a little before applying patches. Let the version 1 fixes be tested in the wild by other folks first. Invariably issues are found that then get corrected by another point release. Waiting a little also gives time for vendors to develop more efficient patches, rather than ones that simply act as a workaround. In any event, your approach will depend on your particular set of circumstances.