Physical Design   «Prev  Next»
Lesson 5 Clustering
ObjectiveExplain the purpose of Clustering

Database Clustering

Earlier in this module, you learned that a computer stores data in blocks. When all of the data from a single file is stored in contiguous blocks, the computer can retrieve the information without moving around the disk, thus saving time and speeding retrieval.
Reading data from and writing data to a disk are the slowest operations involved in using a database.
Improve Database Performance using Clustering
If the data is in RAM, access is significantly faster.
Storing related data in contiguous disk blocks is called clustering. Databases might require different clustering strategies from other types of files. Remember that a document created with a word processor, for example, is a single indivisible entity retrieved most efficiently when it is stored in contiguous blocks.

A document created with a word processor is a single indivisible entity retrieved most efficiently when it is stored in contiguous blocks
The same would be true of database tables if you always used tables as stand-alone items and never combined them. Since you join database tables frequently, clustering records based on the values in the columns used to create joins may improve database performance.

An example of clustering

Because joins are usually based on fields that are primary keys in one table and foreign keys in another table, you can often speed up joins by storing those fields in contiguous disk blocks.
Consider the following two tables 1) CD and 2) Distributor tables:

Tables 1) CD and 2) Distributor for clustering
You determined during Requirements Analysis that your users will need to join these tables frequently, with those joins based on the DistID field (the primary key field for the Distributor table and a foreign key field in the CD table). You do not have room to store every row in both tables in a single data block, so you decide to cluster the entire CD table and the first two rows of the Distributor table (for the distributors with DistID's 101 and 102) into a data block.
The third distributor (DistID 103) does not occur in the table, so it is excluded from the cluster.
The next lesson discusses the performance trade-off inherent to clustering.

Exclude the third distributor (DistID 103) from the cluster