Relational Concepts  «Prev  Next»

Lesson 6 Data denormalization
Objective Describe data denormalization and when it should be used.

Data Denormalization

Transaction | Relation cross-reference Matrix

As you might expect, data denormalization is the opposite of data normalization. In data normalization, you break up the data into multiple entities, or tables. When you denormalize data, you intentionally repeat the data within the same entity. Denormalized data occupies more disk space, but it can enable the software to execute queries more efficiently because it does not involve retrieving data from lots of different tables. Unless you are building a specialized database, such as a data warehouse, you would not generally denormalize data. It is generally a good idea to normalize your database whenever possible.
In the next lesson, you will learn how to enforce entity integrity with primary and unique keys. Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data. In some cases, denormalization is a means of addressing performance or scalability within the context of relational database software.
A normalized design will often store different but related pieces of information in separate logical tables known as relations. If these relations are stored physically as separate disk files, completing a database query that draws information from several relations through a join operation can be slow. There are two strategies for dealing with this. The preferred method is to keep the logical design normalized, but allow the database management system (DBMS) to store additional redundant information on disk to optimise query response. In this case it is the DBMS software's responsibility to ensure that any redundant copies are kept consistent.

Normalization

Normalization is a technique for deciding which attributes belong together in a relation. One of the basic aims of relational database design is to group attributes together in a relation because there is a functional dependency between them. The result of normalization is a logical database design that is structurally consistent and has minimal redundancy. However, it is sometimes argued that a normalized database design does not provide maximum processing efficiency. Consequently, there may be circumstances where it may be necessary to accept the loss of some of the benefits of a fully normalized design in favor of performance. This should be considered only when it is estimated that the system will not be able to meet its performance requirements.
I am not saying that Normalization should be omitted from logical database design: normalization forces us to understand completely each attribute that has to be represented in the database. This may be the most important factor that contributes to the overall success of the system.

Denormalization

  1. denormalization makes implementation more complex;
  2. n denormalization often sacrifices flexibility;
  3. n denormalization may speed up retrievals but it slows down updates.
Formally, the term denormalization refers to a refinement to the relational schema such that the degree of normalization for a modified relation is less than the degree of at least one of the original relations. We also use the term more loosely to refer to situations where we combine two relations into one new relation, and the new relation is still normalized but contains more nulls than the original relations. Some authors refer to denormalization as usage refinement.
As a general rule of thumb, if performance is unsatisfactory and a relation has a
  1. low update rate and
  2. a very high query rate,
denormalization may be a viable option.

When to use Denormalization

Consider denormalization in the following situations, specically to speed up frequent or critical transactions:
  1. Combining one-to-one (1:1) relationships
  2. Duplicating non-key attributes in one-to-many (1:*) relationships to reduce joins
  3. Duplicating foreign key attributes in one-to-many (1:*) relationships to reduce joins
  4. Duplicating attributes in many-to-many (*:*) relationships to reduce joins
  5. Introducing repeating groups
  6. Creating extract tables
  7. Partitioning relations

Transaction/Relation cross-reference Matrix

Often not possible to analyze all transactions, so investigate most important ones.
To help identify these can use:
  1. transaction/relation cross-reference matrix, showing relations that each transaction accesses, and/or
  2. transaction usage map, indicating which relations are potentially heavily used.
To focus on areas that may be problematic:
  1. Map all transaction paths to relations.
  2. Determine which relations are most frequently accessed by transactions.
  3. Analyze the data usage of selected transactions that involve these relations
The transaction/relation cross-reference matrix that may have been produced provides useful information for this step. The matrix summarizes, in a visual way, the access patterns of the transactions that will run on the database.
It can be used to highlight possible candidates for denormalization, and to assess the effects this would have on the rest of the model. More specifically, in this step we consider duplicating certain attributes or joining relations together to reduce the number of joins required to perform a query.