Data de-duplication, popularized by Avamar and Data Domain in the backup space, is the identification of redundant segments of information within files and across volumes. The comparison should always be at a sub-file level, allowing for two consecutive backups of a database, for example, to only store the components within that database that changed. Even with this, there is room for suppliers to differentiate: You have source-side de-dupe vs. target-side, and of course you have the now famous inline vs. post processing debate.
Data de-duplication is moving beyond the backup space into archive space from suppliers like Permabit and Nexsan. There are even content-aware solutions like those from Ocarina Networks that work at a sub-file level and can provide data reduction on data that is visually similar but data different, such as rich media, Windows fileshares, and audio and video, prior to moving that data to an archive.
A device that can identify duplicate files, i.e., the same copy of a file across multiple volumes, should really be called "single instancing" not "de-duplication." While this technology can reduce the size of the backup storage somewhat, it would still have to store each night's copy of the above database example. This type of technology makes sense in email, email archive solutions, and, in some cases, primary storage, since it should have minor performance implications.
In here somewhere goes the conversation about Block-Level Incremental backup and replication technologies, but the term needs a refresh. Most of these technologies work by making a block-level copy, volume by volume, of a server on a secondary storage target. Now, however, some of these solution suppliers are able to snapshot the secondary storage and either automatically roll that snapshot to tape or present the snapshot as a read/writeable volume. Clearly, this is more than backup.