Businesses need constant access to inventory, and to customer, employee and external Web databases. They must be able to get to files instantly, with zero downtime. Obviously, some technologies are closer to the heart of this than others, and those technologies are our focus here. Our definition of data management and storage technology encompasses business, technology and management issues associated with database management and data warehouse systems, as well as high-performance, fault-tolerant storage system architectures; these include backup systems and services, SAN/NAS (storage area network/network-attached storage) solutions, and disaster-recovery procedures.
Data-management systems can be divided into two general categories: First are systems that need to produce instant results on actions taken, such as most e-commerce transactions, and second is data warehousing -- the process of gathering, collating, splicing and making sense of large amounts of data.
Databases are built on the foundation of storing, updating and querying data. When developing a high-performance e-commerce site, you want updates and queries to be quick and failure-free. At the back end, database vendors have included clustering and failover technologies to distribute the work among several machines, at the same time guaranteeing data integrity. OLTP (online transaction processing) is more or less built into the major database products.
In contrast, data warehousing is the process of pulling very large amounts of data from different sources (usually in batches, with extraction, transformation and loading, or ETL, tools), storing it centrally, and using software and tools to make decisions based on experiences documented by the data. This data usually is not dynamic and contains lots of statistical information, so the tools often use complex queries to glean important trends and analysis from it. The databases themselves are sometimes more specialized than the popular RDBMSes (relational database management systems) such as Oracle 8i or IBM's DB2, but the RDBMS vendors recognize the market and are adding more warehousing functionality directly to their databases, or providing separate database products optimized for warehousing applications.
Data marts are basically smaller versions of data warehouses. They're more often purchased as off-the-shelf products that require less configuration and less time to manage.
They can also be built alongside existing data warehouses to solve smaller problems, or at physical locations apart from a central data warehouse, with each data mart pulling the data that's pertinent to its site from the central warehouse.
At the front end of the data warehouse, business intelligence tools and suites analyze and present the data, often using OLAP (online analytical processing) or more complicated data-mining techniques to get at the data. Business intelligence systems include decision-support systems and management information systems; the old terms have been replaced with the new business intelligence face.
What's Happening?
According to a June DBMS market analysis by Gartner Group's Dataquest unit, the overall DBMS market grew 18 percent in 1999, up from about 12 percent growth in 1998 and 6 percent growth in 1997. And in its September "Data Warehousing Infrastructure Software Market Share" report, Dataquest noted that the data warehousing market saw a 32 percent increase from 1997 to 1998 and a 45 percent increase from 1998 to 1999. From this information alone, you can see that data management is becoming more crucial.
Actually, several things are happening at once, making this a fairly complicated category of solutions. First, the databases themselves are maturing. At this point, small speed differences among the databases don't matter as much as what the databases offer in terms of reliability and scalability. Because business data is so critical, especially within e-commerce, it's important to perform well, but it's more important to perform correctly. Clustered and distributed solutions are necessary, and the big vendors are providing them (see "Jockeying for the Lead," on page 148).
Second, database vendors have broadened their offerings to include data warehousing software, financial applications and consulting services to link the two. Large database vendors have acquired smaller, more specialized vendors to fill these needs. The smaller competitors that remain must carefully determine what services they'll offer.
Third, conflicting data warehousing standards are finally starting to merge. In September, the Object Management Group (OMG) and the Meta Data Coalition (MDC) agreed to combine forces and build a single data warehousing standard for metadata; previously, the MDC's Open Information Model (OIM) and the OMG's Common Warehouse Metamodel (CWM) competed. The MDC will merge into the OMG, and the new standard will take the best portions of each and use the CWM name. This will provide numerous benefits for both software vendors, which need to interoperate, and for users, who will have the flexibility of using different vendors for different portions of a data warehousing project.
Because of all this, enterprises need to weigh the costs and benefits of going with a single vendor. From a business standpoint, it might make sense to use Oracle for everything from the company's Web database to financial applications to the application server. You'll have fewer contracts to sign, a simpler support system and a better chance that one set of applications will be able to talk to another.
However, you'll probably pay more for this solution than for something similar you would integrate on your own. You may also find yourself making some concessions in functionality and flexibility. And if you decide you've chosen the wrong company to do business with, it will be more difficult to break away and use a different solution.
As we've suggested, data management and storage technologies are tightly bound. Especially for e-commerce and interactive solutions, it's important to build checks into databases to make sure their storage can automatically accommodate growth--by dying gracefully, warning the administrator before it happens or modifying the way storage is partitioned.
For example, Tivoli Systems is building hooks between its Storage Network Manager and databases that automatically expand tables when they reach their set limits, letting the storage system determine where it can expand them. And although clustering is often built into databases, many high-availability clustering solutions provide APIs that let database vendors and customers build cluster awareness into their systems.
As databases and storage technologies become more enmeshed, it's essential to understand the state of the technology. Whether the subject is databases or storage technology, one point is inarguable: Data is the most important asset your company owns. And that asset is growing faster than any other component on your enterprise network.
How to deal with the importance of the data and its increasing size is becoming a complex situation that requires an intelligent strategy. An understanding of the basics is critical.
NAS Was First
Network-attached storage is a fairly recent development that now seems like a no-brainer. It was a solution to a simple problem. Let's say you needed to quickly add disk space for a project, but you didn't have the time to take your server down, add more hard disks and reformat the entire volume. In all probability, your server's drive bays were full, and you didn't have the resources to invest in a new high-end PC. Standalone NAS devices provided their own operating systems, Ethernet connection and gigabytes of easily accessible storage. You could place them anywhere on your network and, in seconds, share data with multiple platforms, including Unix-, Microsoft Windows-, and Apple Computer Mac OS-based clients.
Some of the first NAS devices were inexpensive, relatively low-capacity (a few gigabytes at most), and small enough to fit under a desk. Cobalt, Hewlett-Packard and Quantum were among the most popular vendors of these lower-end system. Over the past few years, we've seen the capacity, size and performance of these devices skyrocket. Now, vendors such as Network Storage Solutions offer cabinet-size products that can run faster than most servers and store terabytes of information.
SANs Fill a Need
But network-attached storage has its limitations. With the exception of some of the larger, $100,000 solutions, NAS systems don't usually provide effective backup, nor do they off-load storage and backup traffic from the data network. Storage-area networks were introduced to eradicate these shortcomings. SAN devices won't ever replace NAS offerings; they'll simply continue to complement them. And NAS devices will actually become more important as integral components of SANs.
The term storage area network implies a network separate from the LAN, but what we're really talking about is an architecture. A storage area network is a separate computer network, typically based on a "fabric" of Fibre Channel, switches and hubs that connects storage devices to a heterogeneous set of servers on a many-to-many basis. SANs can provide increased performance, scalability, data protection, resiliency, availability and manageability. Performance is heightened because servers talk directly to storage and other servers on a separate 1-Gbps network. The storage area network doesn't have to contend with standard LAN traffic, such as SAPs (Service Advertise Protocols) and ACKs (acknowledgments), and the LAN doesn't have to deal with storage traffic, such as unattended backup and restore.
And SANs are scalable. They use Fibre Channel, which, unlike SCSI, can handle dozens of devices stretched across a long distance. This means the storage doesn't have to connect directly to the server or even be in the same room. Standalone devices, such as NAS boxes, can also hold many more drives than a PC-based file server could. Data protection is increased, because you can place tape devices on a Fibre Channel network and have them back up multiple servers.
You can also use a SAN to provide mirroring and fault-tolerant devices to provide better resiliency. And availability is widened, because you can create a single storage area to which all servers and users can point. In fact, a single SAN can store the data for a complete WAN as long as the links are fast enough to handle the load. SANs are also easier to manage, because they're all part of the same network and located in the same area. For companies with huge databases full of customer records, a SAN can shave off seconds when it comes to record retrieval--thereby ultimately boosting customer satisfaction.
Accessibility of data is crucial, as is speed. While 1 Gbps seems like a lot today, plans are already under way to reach 2 Gbps in the next year or so. The most heated debate over the future of SANs centers on Layers 2 and 3 of the OSI model. Fibre Channel is the accepted transport for SAN products today, but many network vendors claim that Gigabit Ethernet is a better solution. After all, Ethernet is ubiquitous, so why not use it for every part of your network, including the SAN? This would allow you to manage a single flat network, without switching transports along the way.
Gigabit Ethernet might seem like a logical choice for the back end of SANs, but the industry will likely stick with Fibre Channel for several reasons. First and foremost, disk vendors don't want to integrate Ethernet controllers into their storage devices; that would raise cost and complexity. Second, Fibre Channel offers significantly more fault tolerance and faster failover times than Ethernet.
And while some of this could be skirted with proprietary protocols over Ethernet, why should incumbent vendors interfere with a proven, working enterprise technology?
Another trend in storage is the move toward IP, and away from proprietary solutions, as the protocol of choice. In fact, an SoIP (storage over IP) initiative is already in the works. With SoIP, in conjunction with Ethernet-equipped storage, you'll be able to attach and access a storage device just as easily as you can a client PC. SoIP offers a much more reasonable alternative for unifying Fibre Channel and Ethernet networks. Rather than trying to change the status quo, vendors can build interoperable SAN components that are connectable to the back-end network through a variety of network interfaces. Even better, SoIP will let you locate parts of your disk farm remotely across an IP MAN or WAN, and even run SAN traffic inside a VPN (virtual private network).
To be realistic, most SAN vendors are relying solely on Fibre Channel, and, for now, Ethernet is just a pipe dream. Likewise, IP still has a lot of overhead that's unwanted for storage and must be fixed before the protocol can become viable in this area. But IP does rule, and Ethernet is the medium of choice, so stay tuned to see what develops.