A database primary key is a column or combination of columns with a value that uniquely identifies each row in a table.
As long as you have unique primary keys, you also have unique rows, which makes it possible to uniquely identify every row in a table. Question: Why is this so important?
The issue is the same as with entity identifiers, which is you want to be able to retrieve every single piece of data you have inserted into a database. As far as a relational database is concerned, you should need only
three pieces of information to retrieve any specific bit of data:
the name of the table,
the name of the column, and
the primary key of the row.
If primary keys are unique for every row, then we can be sure that we are retrieving exactly the row we want.
If they are not unique, then we are retrieving only some row with the primary key value, which may not be the row that contains the data for which we are searching.
What makes a good or bad primary key in a Database?
A primary key in a database is a unique identifier for each record in a table. It helps to ensure data consistency and integrity. The choice of a primary key can have a significant impact on database performance and ease of use. Here are some characteristics of a good primary key and some pitfalls to avoid when choosing a primary key:
Good primary key characteristics:
Uniqueness: Each value in the primary key column should be unique to ensure that each record can be uniquely identified.
Non-null: Primary key values should not be null or empty, as this would hinder their ability to uniquely identify records.
Stability: A good primary key should have stable values that do not change over time. Changing primary key values can lead to data integrity issues and require updates to foreign key references in related tables.
Simplicity: Ideally, a primary key should be as simple as possible. Single-column primary keys are often preferred, but sometimes a combination of columns (composite key) is necessary to ensure uniqueness.
Minimal size: Smaller primary key values, such as integers or short strings, are generally more efficient for indexing and querying. Large or complex primary key values can slow down database performance.
Bad primary key characteristics:
Meaningful data: Using a column with meaningful data (e.g., Social Security numbers or email addresses) as the primary key can be problematic. This data might change over time or may not be unique, which can lead to integrity issues.
Large or complex data types: Using large or complex data types (e.g., long strings, large numbers, or binary objects) as primary keys can negatively impact database performance due to increased storage requirements and slower indexing.
Sequentially generated values with predictable patterns: While auto-incrementing integers can be convenient primary keys, they may not be suitable for all scenarios, especially when security or privacy is a concern, or when distributed databases are involved.
Composite keys with too many columns: While composite keys can help ensure uniqueness, using too many columns can complicate queries and negatively impact performance.
Duplicates or null values: A bad primary key would allow duplicate or null values, making it difficult to uniquely identify records and maintain data integrity.
When designing a database schema, carefully consider the choice of primary keys for each table, keeping in mind the characteristics of a good primary key and the pitfalls to avoid. In some cases, using a surrogate key (an artificial, system-generated identifier) may be the best option, as it ensures uniqueness, stability, and simplicity while separating the primary key from any meaningful data.
Rules to observe when designating a Primary Key
Along with being unique, a primary key must not contain the value null. Null is a special database value meaning "unknown". It is not the same as a zero or a blank. If you have one row with a null primary key, then you that particular row will not contain any data. However, the minute you introduce a second one, you have lost the property of uniqueness. We therefore forbid the presence of nulls in any primary key columns. This constraint, known as entity integrity, will be enforced by a DBMS whenever data are entered or modified.
Identification of primary keys is an important part of entity type identification.
Ideally, primary keys should be
stable and
single purpose.
Stable means that a primary key should never change after it has been assigned to an entity.
Single purpose means that a primary key attribute should have no purpose other than entity identification. Typically, good choices for primary keys are integer values automatically generated by a DBMS. Access has the AutoNumber data type for primary keys and Oracle/PLSQL uses the Sequences (Autonumber) for primary keys. As noted in the preceding lesson, tables consist of
key columns and
descriptor columns.
Descriptor columns contain information about the person, place, or thing described by a particular row in the table. Key columns contain information that distinguishes elements from every other item represented in the table. A key column (or set of key columns) that distinguishes a record from every other record in a table is referred to as a primary key.
Most database designers mark primary-key fields in an ER diagram with an asterisk (*). Some relational database management systems underline the key fields or write the field's name in bold-face type.
Additional Guidelines for Primary Keys
A primary key should never be a value that is also used outside the database, such as a
Social Security number (Social Insurance Number in Canada) or
license-plate number.
Social Security numbers are especially bad choices because they can be recycled, incorrectly assigned, or changed. Furthermore, the U.S. government strictly limits when Social Security numbers can be collected and used. A better practice is to create a unique identifier to represent each row in a table and not try to store meaningful data in the primary-key column.
Description of Candidate Key
A candidate key (potential or permissible key) is a field or combination of fields that can act as a primary key field for a table,
thus uniquely identifying each record in the table. A column or (set of columns) that can be used as a primary key for a table is known as a candidate key.
Figure 2-11 shows five different variations of one table, all of which have valid primary keys, both of one field and more than one field.
Figure 2-11: A table with five and possibly more candidate keys
Candidate Key Example
A car's license-plate number is one example of a candidate key.
Within a given state, only one car can have a particular license number, so a table describing CAR might contain a column labeled PlateNo.
In a nationwide database of license-plate numbers, where the same license plate number could be assigned to cars in different states,
the 1) state and 2) PlateNo key columns would form a candidate key.
The problem with using a license-plate number to identify a car is that
the car's owner can request new plates with a chosen message or
the state may change their numbering system to accommodate more cars.
In either case, the license-plate number associated with the car would change.
Changing key values in a table can cause a great deal of confusion, especially if an update is done incorrectly and the wrong value is assigned to an item. When you design a table, look for candidate keys with the following two properties:
The value in the key column can not change.
The key column cannot contain a null value.
If the candidate key meets these two additional requirements, it can be considered a primary key.
The next lesson defines concatenated primary keys[1] .
[1]concatenated primary key: A concatenated primary key is a database table key made up of more than one field.