• 01/15/2014
    8:06 AM
  • Rating: 
    0 votes
    Vote up!
    Vote down!

Ethernet Interfaces Transform Object Storage

New direct Ethernet interfaces for object-oriented storage will change the rules of storage, allowing for large performance gains.

The idea of direct Ethernet drive interfaces dates back to at least 2001, although very little interest was generated among the very conservative storage clientele. A product platform using this technology finally arrived just a few months ago, and it may end up being one of the profound game-changers in the industry. 

At its simplest, the new interface means a lower-cost solution for small network-attached storage and backup appliances in the SMB sector of the market. This will alleviate the need for controllers and multi-drive boxes, likely replacing them with single external drives that connect directly to the LAN. With Ethernet comes a simple object interface called Kinetic that acknowledges that disk drives generally follow a write-once operation mode, and don't erase data often. (Seagate Technology launched the Kinetic Open Storage Platform in October.)

Those savvy in the storage game can get RAID-like protection by plugging two drives into the LAN and using host software, even with the object interface -- though this might be called replication in object store parlance.

But life gets interesting when you assemble a larger configuration. One drawback of enterprise-grade object stores is the need for many nodes, each with a relatively small amount of storage. That can get expensive. With an Ethernet switch replacing the SAS expander typically used, a JBOD (just a box of disks) can deliver 24 or 64 data stores, greatly simplifying the storage configuration.

Several notable vendors have projects underway to build storage appliances using JBODs and object storage code running on a server core, distributing the data blocks and replicas onto the JBODs. It's quite possible that the server core could be a virtual engine on an x86 server, pointing to some very slim storage systems. 

The servers handle deduplication and compression, so there's no loss of functionality. In time, this code will probably migrate to every server in the cluster, allowing data to be moved directly to the data stores. Compression saves on storage space, but host-based compression has the added advantage of reducing network bandwidth. It's like having 40 Gbit/s Ethernet for the price of 10 Gbit/s!

Direct Ethernet connection to the drives removes a layer of hardware, and all the associated latencies, from the storage stack. In addition, the object-oriented interface on the drive removes the layers of indirect data paths we've built into traditional SCSI-based block layer protocols.

For example, block IO completes up to four address translations before reaching the SCSI interface, and then a typical IO goes through LUN address translation, drive SCSI address translation, and drive physical address translation. They all take time, especially if a data lookup is involved. 

In an object storage system, the whole stack right up to the top layer in the host can be replaced by the object store code, which can figure the placement of the required data using the same algorithm that placed it in a data store in the first place. A native interface for object store drives would be much thinner and more efficient than the current approach. SSD and all-flash arrays would benefit from the new interface as well.

There are still some challenges to overcome. Most object stores spread data over a set of nodes. This helps overall efficiency of storage and prevents hotspots, but the algorithm typically breaks data into 64-KB chunks that are too small for today's storage, reducing system performance. Larger chunk sizes will improve the situation.

Advanced data protection systems, such as erasure coding (where data is striped over many drives), also face issues. This is easier when the physical drives are owned by a single controller such as a RAID card, rather than a globally shared resource as in the new system.

Security concerns could emerge, given that third parties could access the Ethernet "storage fabric." This is no more of problem than we already face in iSCSI and FCoE networks, however.

Ethernet object storage looks very promising, especially when you think about the interfaces in storage converging on Ethernet as a single solution in the longer term. We can expect alternatives to Kinetic to appear as the compute power of drives increases. This could be the beginning of something as big as the advent of SCSI, which changed drives forever.

Jim O'Reilly is a former IT executive and currently a consultant focused on storage and cloud computing.


Seagate Kinetic Innovation is Ethernet AND Key/Value API

Well, putting Ethernet interfaces on HDDs is not new but Seagate's James Hughes and his team combined Ethernet and a key/value API to create the innovation of the Kinetic HDD.  The Seagate Kinetic Open Storage Platform does away with the need for POSIX file systems and RAID hardware, which basically means no more storage servers.  What you will have are application servers "managing" some number of JBOK (Just a Bunch of Kinetics) trays that could contain up to 60 Kinetic HDDs in a 4u enclosure or up to 2.4PB per 42u cabinet (600 HDDs x 4TB).  The application servers and JBOKs could be in the same data center or they could be separated from each other.  The Kinetic HDDs use the same SAS/SATA connector to plug into an Ethernet Layer 2 switched backplane.  The Kinetic JBOK uplinks to a Top of the Rack switch using multiple 10GbE interfaces. The connection layer between the application and the Kinetic HDDs consists of the Seagate Libkinetic, Google Protocol Buffers and Gigabit Ethernet.  The Kinetic Open Storage Platform is not suitable for all types of data storage, but it is suitable for applications that make use of object storage.  The Kinetic SDK will be available in Q1 2014 from Seagate.  Object storage software vendors like Basho, Scality and SwiftStack are already working with it, but it could take six months or so before they get to production ready use.  

Re: Seagate Kinetic Innovation is Ethernet AND Key/Value API

Kinetic appears to create a way of connecting a disk drive flexibly, but it still leaves the question of where all the services such as dedup and Ceph distribution and so on run. That means it is more akin to another way to connect a disk drive, than a replacement for a SAN or Ceph Object Store.

Removing the Posix stack makes a lot of sense, but RAID needs to be replaced by  replication or such, and that doesn't seem to fit Kinetic itself, but is more a higher level question.

That to me puts Kinetic drives in the role of Object Store Drives (OSD) rather than storage system replacements. A but humbler but still important!

Re: Seagate Kinetic Innovation is Ethernet AND Key/Value API

From the Ceph wiki with regard to using Seagate Kinetic on the backend...looks like the matter is being addressed by interested parties.

Current Status

  • a KeyValueStore backend exists that is based on GenericObjectMap and the generic KeyValueDB
  • a C/C++ kinetic API will be available shortly
  • the k/v backends are key-granularity.  byte object data is striped across keys of some size, and small updates require a read/modify/write.

Detailed Description

There are several challenges to run Ceph backed by kinetic:

  • the kinetic API does not expose a transaction interface.  for the first-pass prototype we will ignore this.  for a second pass, we will either do some form of write-ahead logging, come up with something clever that fits well with GenericObjectMap, or convince the kinetic folks to expose transactions through their API.  Interestingly, other backends that do have some 'batch put' capabilities still limit the size of the transaction, so adding a fallback that writes the entire transaction and then applies it and then setting the supported transaction size to 1 should capture the degenerate case.  

Work items

Coding tasks

  1. make KeyValueDB wrapper for libkinetic
  2. make a transactional fallback
    1. a KeyValueDB implementation that sits on top of a NonTransactionalKeyValueDB which includes a batch_put() operation and a get_max_batch_size()
  3. build/adapt a simple caching layer that sits between KeyValueDB and GenericObjectMapevaluate under object and block workloads
    1. or one that is integrated into GenericObjectMap to avoid encode/decode overheads
    1. evaluate under object and block workloads