| Lesson 7 | Indexes and engines |
| Objective | Describe how Indexes are used by the Database Engine. |
A common assumption among developers new to database performance is that creating an index guarantees it will be used. This is incorrect. Whether an index is used for a given query depends on three conditions, all of which must be true: current statistics must be available, a suitable index must exist for the query's access pattern, and the optimizer must estimate that using the index costs less than a full table scan.
If any of these conditions is not met, the optimizer chooses a full table scan. The index may exist, may be correctly defined, and may be perfectly suited to the query — but if statistics are stale or the query returns a large fraction of the table's rows, the optimizer still chooses the scan.
Modern relational databases use a cost-based optimizer — a component of the query engine that estimates the execution cost of candidate query plans and selects the lowest-cost plan. Cost is measured in units of I/O operations, CPU cycles, and memory usage, weighted according to the engine's configuration. For each candidate plan, the optimizer estimates how many rows will be processed, how many pages will be read from storage, and how much memory the operation requires.
The optimizer does not execute candidate plans to measure their cost — it estimates cost using statistics stored in the database's system catalog. These statistics include the number of rows in each table, the number of distinct values in each column (cardinality), and histogram data showing how values are distributed. The quality of the optimizer's decisions is directly proportional to the accuracy of these statistics.
Selectivity is the fraction of rows that match a given predicate. A highly selective predicate matches a small fraction of rows — WHERE email = 'gauss@example.com' on a users table with one million rows might match exactly one row, a selectivity of 0.0001%. A low-selectivity predicate matches a large fraction — WHERE status = 'ACTIVE' on the same table might match 80% of rows.
The optimizer strongly favors index access for highly selective queries and strongly favors full table scans for low-selectivity queries. The reason is physical I/O: an index lookup that finds one matching row reads only a handful of pages — the index pages traversed during B-tree navigation plus one data page for the matching row. A low-selectivity query that matches 80% of rows must read 80% of the table's data pages regardless — at that point, a sequential scan of all pages is often faster than the overhead of consulting the index for each of the 800,000 matching rows.
There is a threshold — typically somewhere between 5% and 30% of the table's rows, depending on the engine, the storage system, and the table's physical organization — below which the optimizer prefers index access and above which it prefers a full scan. This threshold is not fixed; it is recalculated for each query based on the current statistics.
A practical implication: an index on a column with low cardinality — a boolean column, a status column with three possible values, a gender column — will rarely be used by the optimizer, because even a query for one specific value is likely to match enough rows to push the selectivity above the tipping point. High-cardinality columns — email addresses, unique identifiers, timestamps — are the best index candidates because their predicates are naturally selective.
-- Triggers an index unique scan on ContactID (primary key)
SELECT Firstname, Lastname
FROM Contacts
WHERE ContactID = 4;
An index range scan occurs when the predicate specifies a range of values — using BETWEEN, greater-than, less-than, or LIKE with a leading non-wildcard prefix. The engine traverses the B-tree to the first matching entry and then reads forward through the sorted index entries until the range is exhausted.
-- Triggers an index range scan on idx_contacts_lastname
SELECT Firstname, Lastname
FROM Contacts
WHERE Lastname BETWEEN 'G' AND 'K';
-- Also a range scan — LIKE with leading literal
SELECT Firstname, Lastname
FROM Contacts
WHERE Lastname LIKE 'H%';
Note that WHERE Lastname LIKE '%ilbert' — a leading wildcard — cannot use the index, because the index is sorted by the first character of Lastname. A leading wildcard makes the predicate non-sargable (not Search ARGument ABLE), forcing a full table scan.
A full index scan reads all entries in the index in sorted order. It is used when the query needs all or most rows from the table but requires them in the index's sort order — for example, an ORDER BY on an indexed column with no selective WHERE clause. A full index scan is faster than a full table scan plus sort when the table's data pages are not in memory, because the index pages are smaller and may be cached more efficiently.
A full table scan reads every data page in the table sequentially. The optimizer chooses a full table scan when: no suitable index exists; the query is not selective enough for index access to be faster; the table is small enough that the index overhead exceeds the scan cost; or the statistics indicate that the index is not selective for the specific values in the query predicate.
A full table scan is not always bad. On a small reference table with a few hundred rows, a sequential scan is faster than B-tree traversal. On a query that genuinely needs most of the table's rows, sequential I/O through the data pages is more efficient than the random I/O pattern of an index lookup followed by individual row fetches. The optimizer chooses full table scans when they are the correct choice — understanding when and why this happens is more useful than assuming every full scan is a performance problem.
SELECT Lastname
FROM BasicTable;
When you add ORDER BY, you are asking the engine to return rows in a defined sequence. If an index exists on the ORDER BY column, the optimizer may be able to satisfy the sort requirement by reading the index entries in key order rather than reading rows in physical storage order and sorting them afterward. Avoiding the sort operation reduces CPU and memory usage — particularly significant for large result sets.
-- With an index on Lastname, the engine may use ordered index access
-- to avoid an explicit sort — the result arrives in sorted order
-- directly from the index traversal
SELECT Lastname
FROM BasicTable
ORDER BY Lastname;
The SQL:2023 standard continues the rule that ordering is only defined when ORDER BY is present. Even with an index available, the optimizer may still choose a full scan plus in-memory sort if it estimates that most rows will be returned, or if other aspects of the execution plan make the ordered index access path less efficient. ORDER BY allows the optimizer to consider ordered index access — it does not guarantee it.
SELECT Firstname, Lastname
FROM Contacts
WHERE Lastname = 'Hilbert';
If an index exists on (Lastname, Firstname), this index covers the query: Lastname is the filter column (in the WHERE clause) and Firstname is the output column (in the SELECT clause). Both columns are in the index. The engine finds the matching Lastname entries in the index, reads the Firstname values from the same index entries, and returns the result without touching the Contacts base table data pages.
To design a covering index, identify all columns referenced in the query's WHERE, SELECT, ORDER BY, and JOIN clauses. Include the most selective filter columns first (they are the leading index columns), followed by the additional columns needed to cover the SELECT list. For the query above, the index (Lastname, Firstname) covers the query efficiently — Lastname filters, Firstname is returned.
Covering indexes trade index size for query speed. A covering index that includes many columns is larger than a single-column index, requiring more storage and more memory in the buffer pool. The trade-off is worthwhile for queries that run frequently and return small result sets from large tables.
Statistics reflect the state of the table at the time they were last collected. After large data loads, bulk deletes, or significant updates to indexed column values, the statistics may no longer accurately represent the current data distribution. Stale statistics cause the optimizer to make decisions based on the old data state — it may underestimate or overestimate the selectivity of a predicate, leading to wrong access path choices: using a full scan when an index would be faster, or using an index when a full scan would be faster.
Most database systems update statistics automatically in the background — PostgreSQL's autovacuum, MySQL InnoDB's background statistics collection, SQL Server's auto update statistics. After major data changes, manually triggering a statistics refresh ensures the optimizer has current information:
-- PostgreSQL
ANALYZE Contacts;
-- MySQL
ANALYZE TABLE Contacts;
-- SQL Server
UPDATE STATISTICS Contacts;
Index Scan means the optimizer used B-tree index navigation followed by row fetches; Seq Scan means a full sequential table scan; Index Only Scan means a covering index was used and the base table was never accessed.
-- PostgreSQL EXPLAIN output for a selective query on an indexed column
EXPLAIN SELECT Firstname, Lastname
FROM Contacts
WHERE Lastname = 'Hilbert';
-- Expected output with idx_contacts_lastname in place:
-- Index Scan using idx_contacts_lastname on contacts
-- Index Cond: ((lastname)::text = 'Hilbert'::text)
-- Expected output WITHOUT the index:
-- Seq Scan on contacts
-- Filter: ((lastname)::text = 'Hilbert'::text)
-- EXPLAIN ANALYZE adds actual execution time and row counts
EXPLAIN ANALYZE SELECT Firstname, Lastname
FROM Contacts
WHERE Lastname = 'Hilbert';
Running EXPLAIN before and after creating an index confirms whether the optimizer is using it. If EXPLAIN shows Seq Scan on a column you have indexed, the optimizer has determined that the full scan is cheaper — most likely because the query is not selective enough, the statistics are stale, or the table is small enough that the index overhead is not worthwhile.
WHERE Lastname = 'Hilbert' or WHERE ContactID = 4 returns a small fraction of the table's rows — exactly the condition under which index access is faster than a full scan. The more selective the predicate, the greater the performance benefit.
ON Orders.ContactID = Contacts.ContactID, an index on Contacts.ContactID allows the engine to look up each ContactID from the Orders table in the Contacts index rather than scanning the entire Contacts table for each Orders row. Without the index, the join requires O(n×m) comparisons; with it, O(n log m).