Lesson 6	Downside to Clustering
Objective	Describe the trade-offs which occur when clustering Data

Data Clustering Trade Off

When you cluster data, you store frequently combined records from different tables as close together on the computer's hard drive as possible. The downside to clustering records from different tables is that records from the same table are no longer stored close to each other on the disk, which affects performance:

Clustering slows down operations that require the RDBMS to search an entire table.
Clustering may slow down data insertion because the RDBMS must search the disk for the last record in the table.
Clustering may increase the time needed to update data in the fields used to define the cluster because the RDBMS must determine again which records to include in the cluster.

Several Trade-offs that need to be considered

Clustering data, which involves storing frequently combined records from different tables as close together as possible on a computer's hard drive, brings several trade-offs that need to be carefully considered:

Improved Performance for Specific Queries: When data is clustered, related records are placed physically close to each other. This results in faster data retrieval for certain types of queries, particularly those that access these related records frequently. The reduced need to move the disk's read/write heads significantly speeds up data access.
Increased Maintenance Overhead: Clustering alters the physical layout of the data. As a result, there is an increased overhead in maintaining this structure, especially when inserting, updating, or deleting records. The database may need to frequently reorganize the data to maintain the clustering, which can be resource-intensive.
Optimized for Specific Access Patterns: Clustering is typically designed around specific access patterns. This means that while it significantly improves performance for some queries, it may not offer benefits for others. In some cases, it might even degrade performance for queries that do not align with the clustering strategy.
Increased Complexity in Database Design: Implementing a clustering strategy adds complexity to the database design. Decisions about which tables and columns to cluster require a deep understanding of the data access patterns and can significantly impact the database's performance.
Potential for Uneven Data Distribution: Clustering can lead to uneven distribution of data across the hard drive. This can result in some areas of the disk being heavily used while others are underutilized. Over time, this might lead to issues like disk fragmentation.
Limited Flexibility: Once a clustering strategy is implemented, it can be challenging to modify or change it without significant reorganization of the data. This can make it difficult to adapt to changing data access patterns over time.
Storage Space Considerations: Clustering can sometimes require additional storage space. This is because it may lead to duplication of data or the need for additional data structures to maintain the clustered organization.

In conclusion, while clustering can offer significant performance benefits for certain queries, it comes with trade-offs in terms of maintenance overhead, complexity, and flexibility. These factors must be carefully weighed against the potential performance gains when deciding whether to implement a clustering strategy in a database system.

Clustering Purpose

The slowest part of a DBMS's actions is retrieving data from or writing data to a disk. If you can cut down on the number of times the DBMS must read from or write to a disk, you can speed up overall database performance. The trick to doing this is understanding that a database must retrieve an entire disk page of data at one time. The size of a page varies from one computing platform to another and it can be anywhere from 1024 bytes to 8 K, with 1 K being typical on a PC. Data always travel to and from disk in page-sized units. Therefore, if you store data that are often accessed together on the same disk page (or pages that are physically close together), you can speed up data access. This process is known as clustering^[1] and is available with many large DBMSs (for example, Oracle).
In practice, a cluster is designed to keep together rows related by matching primary and foreign keys. To define the cluster, you specify a column or columns on which the DBMS should form the cluster and the tables that should be included. Then, all of the rows that share the same value of the column or columns on which the cluster is based are stored as physically close together as possible. As a result, the rows in a table may be scattered across several disk pages, but matching primary and foreign keys are usually on the same disk page. Clustering can significantly speed up join performance. However, just as with indexes, there are some trade-offs to consider when contemplating creating clusters:

Because clustering involves physical placement of data in a file, a table can be clustered on only one column or combination of columns.
Clustering can slow down performance of operations that require a scan of the entire table because clustering may mean that the rows of any given table are scattered throughout many disk pages.
Clustering can slow down insertion of data.
Clustering can slow down modifying data in the columns on which the clustering is based.

The next lesson introduces partitioning.

Clustering Disadvantages - Quiz

Before you move on to the next lesson, click the Quiz link below to reinforce your understanding of clustering.
Clustering Disadvantages - Quiz

[1] clustering: The term clustering has another meaning in the SQL standard. It refers to groups of catalogs (which in turn are groups of schemas) manipulated by the same DBMS. The use of the term in this section, however, is totally distinct from the SQL meaning.