|
|
|
|
Business Intelligence with Smarts
|
 |
|
September 30, 2002
By Lori MacVittie
|
>> continued from previous page
|
Data Clean and Normal
|
|
|
Normalizing data is an important part of database design. And cleaning data is critical for effective analysis. Essentially, normalization is the process of removing duplicate tuples (a tuple is a collection of attributes). This procedure reduces database size and helps ensure data integrity.
Notice that several cells in the non-normalized chart below contain the same value multiple times. To normalize this data, the values are removed and placed in a separate tableÑcommonly referred to as a lookup tableÑand then referenced from the original table.
When a data query is performed on the second set of tables, a join must be performed. Data residing in a data warehouse is often non-normalized because of the amount of data stored and the performance degradation resulting from joins across more than one table when dealing with excessively large data sets. The data used in testing was non-normalized.
It was also very, very dirty. But clean data is essential to ensuring that the information a business-intelligence tool generates is valid and useful. An enterprise applicationÕs native database almost never contains a clean data set. Changes from migration, upgrades and day-to-day interaction introduce errors. As an example, imagine a database in whose cells an X is supposed to represent "yes" and blank spaces represent "no." If, instead, "Y" or "N" appears in place of X or the empty cellÑa common occurrenceÑyour data is dirty.
|
|
|
 |
 |
|
|
|
 |
|