Reduce Data Redundancy

An existing ER diagram was presented in a previous lesson and illustrated how the entities in the diagram could be represented as tables. Now I will show you how to examine your tables to ensure they reduce data redundancy and allow records to be

added,
updated, and
deleted

without inadvertently changing or removing important information. If you are stuck and do not know how to proceed with your database design, focus on creating tables where each table represents a single entity. This is the easiest way to create a solid database design.

What is Data Redundancy?
Data redundancy occurs in database systems which have a field in a table that is repeated in two or more tables. When customer data is duplicated and attached with each product bought, this will create a source of inconsistency, since the entity "customer" may appear with different values for a given attribute. Data redundancy leads to data anomalies and corruption and should be avoided when creating a relational database consisting of several entities. Database normalization prevents redundancy and makes the best possible usage of storage. The proper use of foreign keys can minimize data redundancy and reduce the chance of destructive anomalies appearing. Concerns with respect to the efficiency and convenience can sometimes result in redundant data design despite the risk of corrupting the data.

Examine Database Tables for reducing data redundancy

To examine database tables for reducing data redundancy and ensuring only unique records are added, you can follow these steps:

Understand Normalization Levels
- First Normal Form (1NF): Each column should hold atomic (indivisible) values, and each column should have a unique name. There should be no repeating groups in the table.
- Second Normal Form (2NF): In addition to 1NF, all non-key attributes must be fully dependent on the primary key. This eliminates partial dependencies.
- Third Normal Form (3NF): In addition to 2NF, all attributes must be non-transitively dependent on the primary key. This removes transitive dependencies.
Higher normal forms (like BCNF, 4NF, and 5NF) deal with more specific types of dependencies but are less commonly used unless necessary.
Analyze Existing Tables
- Check for Redundancy: Look for repeated data across rows. If you see the same information (like a customer's address) repeated in many rows, it suggests redundancy.
- Dependency Analysis: Determine if all columns in a table are directly related to the primary key. If not, consider splitting the table or adding foreign keys.
Use SQL Commands for Analysis
- SELECT DISTINCT: Use this to identify if there are duplicate records:
```
SELECT DISTINCT column1, column2 FROM table_name;
        
```
  If the number of rows returned is less than the total number of rows in the table, there are duplicates.
- COUNT and GROUP BY: Check for exact duplicates:
```
SELECT column1, column2, COUNT(*) 
FROM table_name 
GROUP BY column1, column2 
HAVING COUNT(*) > 1;
        
```
Implement Constraints
- Primary Keys: Ensure each table has a primary key to uniquely identify records.
- Unique Constraints: Add UNIQUE constraints on columns or sets of columns where you want to enforce uniqueness beyond the primary key:
```
ALTER TABLE table_name ADD CONSTRAINT 
unique_constraint_name UNIQUE (column1, column2);
        
```
- Foreign Keys: Use foreign keys to enforce referential integrity, which helps in maintaining data consistency across related tables.
Normalization Techniques
- Decompose Tables: Split tables into smaller ones where each table focuses on a single aspect or entity of the data model. For instance:
  - Split a customer orders table into Customers, Orders, and OrderDetails.
- Create Relationships: Use foreign keys to link these tables, ensuring data integrity and reducing redundancy.

Data Integrity Checks

Triggers: Implement database triggers to check for uniqueness or other rules before data insertion or update:

CREATE TRIGGER check_unique BEFORE INSERT ON table_name
FOR EACH ROW
BEGIN
    IF EXISTS (SELECT 1 FROM table_name WHERE column1 = NEW.column1 AND column2 = NEW.column2) THEN
        SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Duplicate record detected';
    END IF;
END;

Regular Audits
- Periodically run scripts or use database tools to audit tables for duplicates or inconsistencies. Use SQL queries or tools like data profiling software to automate this process.
Education and Documentation
- Document the data model and normalization strategy. Ensure all developers and database administrators understand the normalization rules applied to the database.

By following these steps, you can systematically examine and adjust your database design to minimize redundancy and ensure data uniqueness. Remember, while normalization is key, sometimes denormalization might be used for performance reasons in specific scenarios, but this should be approached with caution.

Difference between Redundant Data and Duplicate Data

Care must be taken to distinguish between 1) duplicated data and 2) redundant data. Duplicated data is present when an attribute has two (or more) identical values. A data value is redundant if you can delete it without information being lost. In other words, redundancy is unnecessary duplication. Table Part in Fig. 3.1 a) contains duplicated data, because the value nut occurs twice for the attribute partDescription. The table does not, however, contain redundant data. If, as in Fig. 3.1 b), the value nut is deleted from the row P2 nut, you will no longer be able to tell from the table what the description of part P2 should be.

The image contains two tables labeled (a) and (b), both representing a Part dataset with two columns:
**Columns:**

**partNo**: Represents the part number, which uniquely identifies each part.

**partDescription**: Describes the part (e.g., nut, bolt, washer).

**Rows (Entries):**
Table (a)
| partNo | partDescription| |--------|----------------| | P2 | nut | | P1 | bolt | | P3 | washer | | P4 | nut |
- This table provides full part descriptions for all entries.
Table (b)

| partNo | partDescription| |--------|----------------| | P2 | — | | P1 | bolt | | P3 | washer | | P4 | nut |

In this table, the **partDescription** for `P2` is missing (represented as a blank or dash `—`).

The rest of the values remain the same as in Table (a).

**Key Differences:**

The main difference between (a) and (b) is that in Table (b), the **partDescription** for `P2` is missing, whereas in Table (a), it is correctly labeled as "nut."

Table (b) could represent an incomplete dataset or an instance where certain values are intentionally hidden, null, or missing.

Fig. 3.1 a): Non-redundant duplication
(a) Duplication but no redundancy (b) Deletion of duplicated but non-redundant data value causes loss of information

In contrast, table SupplierPart in Fig. 3.1 b) contains duplicated data which is redundant. This table shows which suppliers supply which parts, and also the descriptions of the parts. Notice that partNo P1 is a bolt and is duplicated, and that the duplication is redundant because, even if the value bolt were deleted from one of the occurrences of P1 bolt, you could still tell from the table that P1 is a bolt. In Fig. 3.1 b), the value bolt has been deleted from the fourth row, but the description of partNo P1 can still be deduced from the first row.

The image contains two tables labeled (a) and (b), both representing a SupplierPart dataset with three columns.
**Columns:**

**supplierNo**: Represents the unique identifier of the supplier.

**partNo**: Represents the unique identifier of the part.

**partDescription**: Provides a textual description of the part (e.g., bolt, nut).

**Rows (Entries):**
Table (a):
| supplierNo | partNo | partDescription| |------------|--------|----------------| | S2 | P1 | bolt | | S7 | P6 | bolt | | S2 | P4 | nut | | S5 | P1 | bolt |
- Each row represents a supplier (supplierNo) supplying a specific part (partNo) along with its description (partDescription).
Table (b):

| supplierNo | partNo | partDescription| |------------|--------|----------------| | S2 | P1 | bolt | | S7 | P6 | bolt | | S2 | P4 | nut | | S5 | P1 | — |
- The structure is the same as Table (a), but the "partDescription" for supplier S5 and part P1 is missing (represented as a blank or dash `—`).
**Key Differences:**

The **partDescription** value for `S5` supplying `P1` is missing in Table **(b)**, whereas it is properly recorded as "bolt" in Table **(a)**.

Table **(b)** might represent an incomplete dataset or a scenario where certain values were not recorded, deleted, or are null.

Fig. 3.1 b): Redundant duplication, (a) Redundantly duplicated data, (b) Deletion of redundantly duplicated data value causes no loss of information

Normalization and removing Duplication in Data

The term normalization means to make normal in terms of causing something to conform to a standard, or to introduce consistency with respect to style and content. In terms of relational database modeling, that consistency becomes a process of removing duplication in data. Removal of duplication tends to minimize redundancy and minimization of redundancy implies getting rid of unneeded data present in particular tables. In reality, normalization usually manages to divide information into smaller, more manageable parts.The most obvious redundancies can usually be removed without involving math. Commercially speaking, the primary objectives of normalization are usually to save space and organize data for usability and manageability, without sacrificing performance. This process can present a challenge and solved through trial and error. Additionally the demands of 1) intensely busy applications and 2) end-user needs can tend to necessitate breaking the rules of normalization in many ways to meet performance requirements. Rules are usually broken simply by not applying every possible layer of normalization. Normal Forms beyond 3rd Normal Form are often ignored and sometimes even 3rd Normal Form itself is discounted.

Normalization can be described as an introduction of granularity, removal of duplication, or minimizing of redundancy, or simply the introduction of tables, all of which place data into a better organized state.

Database Analysis for Design

Normalize your Schema

Yes, once again, somebody is telling you to normalize your schema. In addition to the benefits of normalization that are glorified elsewhere, a normalized schema is far easier to replicate.
Why? Consider a schema that is in first normal form (1NF) meaning its tables contain redundant data.
For example, a CUSTOMER table might have a column company_name.
If this table contains 1000 records for customers who work for Acme Tire and Rubber, then 1000 records will have to be updated when Acme Tire and Rubber changes its name to Acme Tire and Rubber and Lawn Furniture. Since every update is a potential conflict, updates should be kept to a minimum.
In addition, if a field such as company_name appears in numerous tables, you will have to devote significant effort to devising methods to ensure that an update to the field in one table affects the appropriate updates in the other tables not only locally but globally. A more practical concern with a denormalized schema is that such schemas are typically characterized by tables with many columns. Since replicated (DML) Data Manipulation Language must compare the old and new values of every column of every changed row, performance will suffer.
An unfortunate myth among database designers is that normalization reduces performance. The thought process is that since a denormalization can lead to a performance gain, any steps in the opposite direction must lead to performance losses. This conclusion is not accurate and do not denormalize for performance without the metrics to justify it.