Lesson 11	Primary keys
Objective	Define database primary keys.

What is Database Primary Key?

A database primary key is a column or combination of columns which uniquely identifies each row in a table. As long as you have unique primary keys, you also have unique rows, which makes it possible to uniquely identify every row in a table.
This issue exists with entity identifiers, where you want to be able to retrieve every single piece of data you have inserted into a database. As far as a relational database is concerned, you should need only three pieces of information to retrieve any specific bit of data:

the name of the table,
the name of the column, and
the primary key of the row.

If primary keys are unique for every row, then we can be sure that we are retrieving exactly the row we want. If they are not unique, then we are retrieving only some row with the primary key value, which may not be the row that contains the data for which we are searching.

Difference between Entity Identifier in an ERD and primary key in database table

The "entity identifier" in an Entity Relationship Diagram (ERD) and the "primary key" in a database table are related concepts, but they differ in their context and use within the data modeling and database implementation processes:

Entity Identifier in an ERD
- Conceptual Role: The entity identifier represents a unique attribute (or a combination of attributes) that uniquely identifies each instance of an entity in the data model. It is defined at the conceptual level during the design phase.
- Purpose: It helps to ensure that each entity instance can be uniquely distinguished from all other instances. For example, in an ERD for a Customer entity, the entity identifier might be CustomerID or a combination of FirstName and LastName if a single attribute is not enough for uniqueness.
- Attributes: The entity identifier may be depicted as a unique attribute or composite attribute in the ERD, with an underline or other notation to signify its importance.
Primary Key in a Database Table
- Physical Role: The primary key is a specific implementation of the entity identifier when the database is created. It is a physical constraint defined in the database schema.
- Purpose: The primary key ensures that each row in a database table is unique and enforces the integrity of the data. It is implemented using SQL when defining tables in a relational database.
- Characteristics:
  - Uniqueness: A primary key must uniquely identify each row in the table.
  - Non-null: A primary key cannot contain NULL values.
  - Indexing: Most database systems automatically create an index on the primary key to optimize queries.

Key Differences

Level of Use
- Entity Identifier: Used at the conceptual modeling stage of designing a database. It is more abstract and focuses on data modeling and understanding the relationships between entities.
- Primary Key: Used at the physical database implementation stage. It is a concrete part of the database structure.
Implementation
- Entity Identifier: May be composed of one or more attributes that are chosen during the modeling phase.
- Primary Key: Must be explicitly defined in the database schema and adheres to the rules of the specific database system, such as being unique and non-null.

Example

In an ERD, the Customer entity might have CustomerID as the entity identifier.
When implementing this in a relational database, CustomerID would be set as the primary key of the Customer table, ensuring that each customer record is uniquely identified in the table.

In summary, the entity identifier is a modeling concept used to uniquely identify entities, while the primary key is the implemented constraint in the database table that enforces uniqueness and integrity.

What makes a good or bad Primary Key in a Database?

As a data modeler, understanding the characteristics of a good primary key in a relational database is crucial for ensuring data integrity and optimizing database performance. A primary key is a unique identifier for each record in a database table, and its selection can greatly influence the efficiency and reliability of data retrieval and storage. Here are some considerations for what might make a good or bad primary key:
Characteristics of a Good Primary Key:

Uniqueness: The most fundamental characteristic of a good primary key is that it must uniquely identify each record in the table. There should be no two records with the same primary key value.
Non-Nullability: A primary key must always have a value; it cannot be NULL. This ensures that every record can be uniquely identified.
Stability and Invariance: Ideally, the value of a primary key should not change over time. Changing primary key values can lead to complex updates and potential data integrity issues.
Simplicity: A simple and concise primary key is generally preferable. Single-column keys are often more efficient than composite keys (keys composed of multiple columns).
Sequential: If possible, using a sequential primary key (like an auto-increment integer) can be beneficial for indexing and performance, particularly in large databases.
Minimal: The primary key should be as small as possible in terms of data type size. Smaller keys take up less space, improving indexing and search performance.

Characteristics of a Bad Primary Key:

Non-unique Values: A primary key that fails to uniquely identify records is fundamentally flawed and can lead to significant data integrity problems.
Use of Personal or Sensitive Information: Using sensitive data like Social Security Numbers or email addresses can pose security risks and might lead to privacy concerns.
Complexity: Overly complex primary keys, such as composite keys with many columns, can complicate queries and degrade performance.
Mutable Fields: Using fields that are likely to change (e.g., a person's name) can lead to the need for cascading updates, which can be error-prone and inefficient.
Large or Unwieldy Data Types: Large data types, such as lengthy strings or binary data, can lead to inefficiencies in indexing and slow down query performance.
Meaningful Data: While sometimes tempting, using meaningful data (like an employee ID composed of department codes and job titles) can create issues if the inherent meaning changes over time.

In summary, a good primary key is one that is unique, stable, simple, and ideally sequential and minimal in size. On the other hand, primary keys that are non-unique, complex, mutable, large, or carry meaningful and sensitive information, are generally not advisable. The choice of a primary key should be guided by these principles, alongside a consideration of the specific requirements and context of the database system.

Rules to observe when designating a Primary Key

Along with being unique, a primary key must not contain the value null. Null is a special database value meaning "unknown". It is not the same as a zero or a blank. If you have one row with a null primary key, then you that particular row will not contain any data. However, the minute you introduce a second one, you have lost the property of uniqueness. We therefore forbid the presence of nulls in any primary key columns. This constraint, known as entity integrity, will be enforced by a DBMS whenever data are entered or modified.
Identification of primary keys is an important part of entity type identification. Ideally, primary keys should be

stable and
single purpose.

Stable means that a primary key should never change after it has been assigned to an entity.
Single purpose means that a primary key attribute should have no purpose other than entity identification. Typically, good choices for primary keys are integer values automatically generated by a DBMS. Access has the AutoNumber data type for primary keys and Oracle/PLSQL uses the Sequences (Autonumber) for primary keys. As noted in the preceding lesson, tables consist of

key columns and
descriptor columns.

Descriptor columns contain information about the person, place, or thing described by a particular row in the table.
Key columns contain information that distinguishes elements from every other item represented in the table. A key column (or set of key columns) that distinguishes a record from every other record in a table is referred to as a primary key. Most database designers mark primary-key fields in an ER diagram with an asterisk (*). Some relational database management systems underline the key fields or write the field's name in bold-face type.

Additional Guidelines for Primary Keys: A primary key should never be a value that is also used outside the database, such as a
1. Social Security number (Social Insurance Number in Canada) or
2. license-plate number.
Social Security numbers are especially bad choices because they can be recycled, incorrectly assigned, or changed. Furthermore, the U.S. government strictly limits when Social Security numbers can be collected and used. A better practice is to create a unique identifier to represent each row in a table and not try to store meaningful data in the primary-key column.

Database Analysis for Design

Description of Candidate Key

A candidate key (potential or permissible key) is a field or combination of fields that can act as a primary key field for a table, thus uniquely identifying each record in the table. A column or (set of columns) that can be used as a primary key for a table is known as a candidate key. Figure 2-11 shows five different variations of one table, all of which have valid primary keys, both of one field and more than one field.

Figure 2-11: Customer-candidate-keys — The image above is an entity-relationship diagram (ERD) representing variations of a `Customer` table in a relational database system. The diagram showcases different candidate keys for the `Customer` table, highlighting different ways of organizing attributes.
**Analysis of the `Customer` Table Structures**
Each `Customer` table in the diagram contains the following attributes:

**customer_id**: A unique identifier for the customer.

**customer**: Likely representing the customer's name or identifier.

**currency_code**: A code representing the currency the customer uses.

**currency**: The full name of the currency.

**exchange_rate**: The exchange rate associated with the customer's currency.

**address**: The customer's address.

Variations in the Candidate Keys:
The diagram shows different possible configurations of candidate keys (potential primary keys). Here are the main differences:

**Candidate Key: `customer_id` (Top-left)**

`customer_id` serves as the unique key.

All attributes remain in one table without redundancy.

**Candidate Key: `address` (Top-center)**

`address` is the primary key.

The other attributes (`customer_id`, `currency_code`, etc.) are included as dependent fields.

**Candidate Key: `customer` (Top-right)**

The `customer` field serves as the primary key.

The remaining attributes rely on `customer`.

**Candidate Key: `currency` (Bottom-left)**

`currency` is the primary key.

`customer_id` and `address` are treated as related attributes.

**Candidate Key: `currency_code` (Bottom-right)**

`currency_code` is the primary key.

Other attributes (`customer_id`, `currency`, `exchange_rate`, `address`) depend on this key.

**Conclusion**

The diagram showcases various normalization strategies for defining unique constraints on the `Customer` table.

Different candidate keys impact how data redundancy and integrity are managed.

Some variations (such as using `currency_code` or `address` as the primary key) may introduce redundancy and are likely not optimal for normalization.

Figure 2-11: A table with five and possibly more candidate keys

Candidate Key Example: A car's license-plate number is one example of a candidate key.
1. Within a given state, only one car can have a particular license number, so a table describing CAR might contain a column labeled PlateNo.
2. In a nationwide database of license-plate numbers, where the same license plate number could be assigned to cars in different states, the 1) state and 2) PlateNo key columns would form a candidate key.
The problem with using a license-plate number to identify a car is that
1. the car's owner can request new plates with a chosen message or
2. the state may change their numbering system to accommodate more cars.
In either case, the license-plate number associated with the car would change. Changing key values in a table can cause a great deal of confusion, especially if an update is done incorrectly and the wrong value is assigned to an item. When you design a table, look for candidate keys with the following two properties:
1. The value in the key column can not change.
2. The key column cannot contain a null value.
If the candidate key meets these two additional requirements, it can be considered a primary key.
The next lesson defines concatenated primary keys^[1] .

[1]concatenated primary key: A concatenated primary key is a database table key made up of more than one field.