When you cluster data, you store frequently combined records from different tables as
close together on the computer's hard drive as possible.
The downside to clustering records from different tables is that records from the
same table are no longer stored close to each other on the disk, which affects performance:
- Clustering slows down operations that require the RDBMS to search an entire table.
- Clustering may slow down data insertion because the RDBMS must search the disk for the last record in the table.
- Clustering may increase the time needed to update data in the fields used to define the cluster because the RDBMS must determine again which records to include in the cluster.
The slowest part of a DBMS's actions is retrieving data from or writing data to a disk. If you can cut down on the number of times the DBMS must read from or write to a disk,
you can speed up overall database performance. The trick to doing this is understanding that a database must retrieve an entire disk page of data at one time. The size of a page varies from one computing platform to another and it can be anywhere from 1024 bytes to 8 K, with 1 K being typical on a PC. Data always travel to and from disk in page-sized units. Therefore, if you store data that are often accessed together on the same disk page
(or pages that are physically close together), you can speed up data access. This process is known as
clustering[1] and is available with many large DBMSs (for example, Oracle).
In practice, a cluster is designed to keep together rows related by matching primary and foreign keys.
To define the cluster, you specify a column or columns on which the DBMS should form the cluster and the tables that should be included. Then, all of the rows that share the same value of the column or columns on which the cluster is based are stored as physically close together as possible. As a result, the rows in a table may be scattered across several disk pages, but matching primary and foreign keys are usually on the same disk page. Clustering can significantly speed up join performance. However, just as with indexes, there are some trade-offs to consider when contemplating creating clusters:
[1]
clustering: The term clustering has another meaning in the SQL standard. It refers to groups of catalogs (which in turn are groups of schemas) manipulated by the same DBMS. The use of the term in this section, however, is totally distinct from the SQL meaning.