• 06/05/2014
    7:00 AM
    Jim O'Reilly
  • Jim O'Reilly
  • Commentary
  • Connect Directly
  • Rating: 
    0 votes
    Vote up!
    Vote down!

Compromising Virtualization

Features such as instance storage are taking the virtual out of virtualization. The performance gains are tremendous, but the techniques open cloud security risks.

Virtualization in its pure form has totally stateless servers, with user data stored on networked drives. This is a great computer science concept, with a good deal of appeal, but it has run into the realities of shared storage farms for a while, forcing compromises in implementation that reduce the level of virtualization considerably and which add some risk to cloud usage.

It was recognized early on that storage performance was going to be a problem, since most server connections even a couple of years back were just 1 Gigabit Ethernet, which just about matched the throughput of a single disk drive. With as many as 64 virtual instances per server, this clearly was too slow, and it led to situations such as virtual desktop boot storms that created huge levels of dissatisfaction among end-users.

To speed up performance, hypervisor developers looked to a variety of deduplication options, all aimed at reducing the number of times common code is loaded into a server. The hypervisor itself is often loaded from a thumb flash drive, and a logical extension is to keep a single common version of the operating system used in each server.  While this makes the servers stateful, it can be argued that it's only true at the platform level not at the user level, so the system is still truly virtual.

Many application environments, such as a LAMP stack, also are very standard, and a logical extension of the thought process is to add these to the image cache in each server, again without compromising user state. If the application had low levels of user I/O to storage, this essentially resolved the problems.

It's worth noting that a complementary approach uses all-flash arrays for the user storage, with 10GbE allowing apps with high I/O levels to perform much better.

However, with cloud-based services for big data analytics looking to be a high-growth segment, instances that could handle the storage performance issues were needed. Large memory sizes for in-memory databases were no problem, but adequate I/O to feed that memory was. Most in-memory database systems use PCIe SSD to load data and require a low latency for write operations.

The cloud service providers' answer to this problem is instance storage. Since most in-memory instances use a whole server, and the rest are limited to two or four per server, why not add an "instance store" to the virtual instance? In other words, this is an SSD that gives the same performance as a direct-attached drive in an in-house server.

This is a very elegant compromise, and performance-wise, offers the same levels as a dedicated SSD. However, there's a real risk to security that has to be addressed. An early security "hole" in virtualization was the fact that DRAM held the state, and data, of a user when it was torn down. An astute hacker could have opened a new instance on any machine and seen the previous user's data as a result;  memory need to be purged between users. That problem has been fixed for a while, but the instance storage concept re-opens the problem, with the added issues that purging a terabyte of instance storage is non-trivial.

There are two primary concerns with instance stores. One is where the server fails -- the SSD records the state of the user's app at the point of failure. Unlike networked storage, the option to encrypt data at rest is problematic due to performance considerations. With a zero-repair approach to servers, data  could remain for several years in the CSP's shop, but the likely outcome will be the drive getting crushed prior to the servers being sold on, or at least overwritten.

The other concern is more serious. If power fails or instance tear-down occurs, the SSD is still stateful. It has to be purged before re-use. This isn't trivial, since SSDs continually recycle deleted blocks into the spares pool, and only overwrite the data when the block is up for re-use. Malware could find those un-erased blocks and read the data out, compromising security in a big way.

On a hard drive, the traditional fix is to write random data to the deleted blocks. Irrespective of whether this is done by the user or by the hypervisor, this will not work with SSD because of the over-provisioning, especially if multiple users share the drive. Ideally, a vendor-provided utility would be needed to do the job properly, but having this work with proprietary partitioning from the hypervisor vendor complicates things.

One solution is the method used in Microsoft Azure, where metadata indicates if the block is a valid read for the new user. If the block hasn't been written, the data is returned as zeroes, protecting the original user's security. However, most other CSPs appear to be silent on this issue.

At PCIe SSD speeds, Azure's method will compromise performance.  Further, the Azure solution leaves  users' data on the (former) instance storage and out of their control for an extended time. It also does not explicitly discuss the over-provisioned spare pool with SSD and, as I'm writing this column, these appear to be unresolved problems. 

With big data including healthcare information and financials, un-erased data could well be a problem for compliance to HIPAA and SOX. You should request a detailed explanation of your CSP's approach to the issue. In-house cloud operations also should explore any exposures.

Long term, we'll continue this debate when persistent NVDIMM and flash's successors enter the picture. These and the Hyper-Memory Cube approach increase bandwidth, which will force CSPs to use the technologies in a stateful way. This issue won't go away soon!



These are some pretty serious security/compliance problems Jim. Do you know if there are any industry efforts underway to address them?

Re: solutions?

I haven't seen much debate yet,partly because this is relatively new, and partly because the NSA disclosures are taking up bandwidth in security circles.

Azure seems to be responding though.

Wipe that memory device

This is actually a known problem. Cloud user Kenneth White has publicly related how he logged into his new DigitalOcean account and, out of curiousity, was able to find 18 GB of the previous user's data on the solid state drives assigned to him (cited here: When the user of the memory device, whether hard disk, RAM or solid state, changes, the memory needs to be wiped. It's not a hard problem to solve, but it's one that's tending to go unattended to in periods of rapid growth. Wipes interfere with maximizing efficiency and profit, but cloud suppliers better learn to do them.

Re: Wipe that memory device

That's a real horror story!

Re: Wipe that memory device

I can't say I'm surprised that cloud users are finding other people's data in the space they rent. And we aren't doing potential users any favors by burying those accounts in stories without having someone like Charlie who can go find the details for them :(

As a user closing an account, I wonder if there's anything you can do to make sure your data gets wiped?

Re: Wipe that memory device

I think the usual recommendation is for cloud users to make sure the SLA has a data deletion requirement upon contract termination. Enforcing that might be another issue.

Re: Wipe that memory device

Right -- I wonder if the original problem is that the providers are not complying with the SLAs, or that the SLAs are not in place in the first place? Two very different problems to solve...

Re: Wipe that memory device

I think in some cases, only what's called a "click-wrap" agreement is available, which doesn't give the customer the ability to negotiate an SLA with specific requirements.

Re: Wipe that memory device

Aha, time to look for a provider that will promise to wipe your data, in that case. Doesn't seem like too much to ask!

Re: Wipe that memory device

If you have that "promise" how do you get proof? Also, how many cloud users are even aware this is a potential problem?

Re: Wipe that memory device

Most sites just don't discuss the issue, and sign-up is pretty well a standard take it or leave it agreement.

Azure has a partial answer, but they don't wipe data, they just proactivly prevent access to it, which is better than nothing, but it still means the data is out there.

We need the CSPs to own up to how they resolve this issue, and also to state how they clear your user space in DRAM memory after you close an instance.

Keeping up

Great article Jim. One of the reasons I like to participate on these sites is it keeps me up to date on new things coming out and issues that may arise as a result. This is a real problem. Performance should never be taken over security but it often is. If performance suffers because of security then find a new way to secure the data that won't affect performance as much. Easier said than done I know.