Assess index design for performance

In this we will learn how to identify and implement index design for performance.

Index Design Basics

Think about a regular book: at the end of the book there is an index which helps to quickly locate information within the book. The index is a sorted list of keywords and next to each keyword is a set of page numbers pointing to the pages where each keyword can be found. A SQL Server index is no different: it is an ordered list of values and for each value there are pointers to the data pages where these values are located. The index itself is stored on pages, making up the Index Pages in SQL Server. In a regular book, if the index spans multiple pages and you have to find pointers to all the pages that contain the word “SQL”. For example, you would have to leaf through until you locate the index page that contains the keyword “SQL”.

However, a SQL Server index is an on-disk or in-memory structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. For on-disk indexes, these keys are stored in a tree structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.

Index Design Tasks

The follow tasks make up our recommended strategy for designing indexes:

Firstly, understand the characteristics of the database itself.
- For example, is it an online transaction processing (OLTP) database with frequent data modifications that must sustain a high throughput. Starting with SQL Server 2014 (12.x), memory-optimized tables and indexes are especially appropriate for this scenario, by providing a latch-free design.
Secondly, understand the characteristics of the columns used in the queries. For example, an index is ideal for columns that have an integer data type and are also unique or nonnull columns. For columns that have well-defined subsets of data, you can use a filtered index in SQL Server 2008 and higher versions.
Thirdly, determine which index options might enhance performance when the index is created or maintained. For example, creating a clustered index on an existing large table would benefit from the ONLINE index option.
Lastly, determine the optimal storage location for the index. A nonclustered index can be stored in the same filegroup as the underlying table, or on a different filegroup. The storage location of indexes can improve query performance by increasing disk I/O performance. For example, storing a nonclustered index on a filegroup that is on a different disk than the table filegroup can improve performance because multiple disks can be read at the same time.

General Index Design Guidelines

Experienced database administrators can design a good set of indexes, but this task is very complex, time-consuming, and error-prone even for moderately complex databases and workloads. Understanding the characteristics of your database, queries, and data columns can help you design optimal indexes.

Database Considerations

When you design an index, consider the following database guidelines:

Firstly, large numbers of indexes on a table affect the performance of INSERT, UPDATE, DELETE, and MERGE statements because all indexes must be adjusted appropriately as data in the table changes.
Secondly, indexing small tables may not be optimal because it can take the query optimizer longer to traverse the index searching for data than to perform a simple table scan. Therefore, indexes on small tables might never be used, but must still be maintained as data in the table changes.
Thirdly, indexes on views can provide significant performance gains when the view contains aggregations, table joins, or a combination of aggregations and joins.
Lastly, use the Database Engine Tuning Advisor to analyze your database and make index recommendations. For more information, see Database Engine Tuning Advisor.

Query Considerations

When you design an index, consider the following query guidelines:

Firstly, create nonclustered indexes on the columns that are frequently used in predicates and join conditions in queries. These are your SARGable1 columns.
Secondly, covering indexes can improve query performance because all the data needed to meet the requirements of the query exists within the index itself. That is, only the index pages, and not the data pages of the table or clustered index, are required to retrieve the requested data; therefore, reducing overall disk I/O.
Thirdly, write queries that insert or modify as many rows as possible in a single statement, instead of using multiple queries to update the same rows. By using only one statement, optimized index maintenance could be exploited.
Lastly, evaluate the query type and how columns are used in the query. For example, a column used in an exact-match query type would be a good candidate for a nonclustered or clustered index.

Clustered Index Design Guidelines

Clustered indexes sort and store the data rows in the table based on their key values. There can only be one clustered index per table, because the data rows themselves can only be sorted in one order. With few exceptions, every table should have a clustered index defined on the column, or columns, that offer the following:

Firstly, can be used for frequently used queries.
Secondly, provide a high degree of uniqueness.
Lastly, can be used in range queries.

If the clustered index is not created with the UNIQUE property, the Database Engine automatically adds a 4-byte uniqueifier column to the table. When it is required, the Database Engine automatically adds a uniqueifier value to a row to make each key unique. This column and its values are used internally and cannot be seen or accessed by users.

Filtered Index Design Guidelines

A filtered index is an optimized nonclustered index, especially suited to cover queries that select from a well-defined subset of data. It uses a filter predicate to index a portion of rows in the table. A well-designed filtered index can improve query performance, reduce index maintenance costs, and reduce index storage costs compared with full-table indexes.

Applies to: SQL Server 2008 through SQL Server 2019 (15.x).

Filtered indexes can provide the following advantages over full-table indexes:

Firstly, improved query performance and plan quality. Here, a well-designed filtered index improves query performance and execution plan quality because it is smaller than a full-table nonclustered index and has filtered statistics.
Secondly, reduced index maintenance costs. In this, an index is maintained only when data manipulation language (DML) statements affect the data in the index. A filtered index reduces index maintenance costs compared with a full-table nonclustered index because it is smaller and is only maintained when the data in the index is affected. However, it is possible to have a large number of filtered indexes, especially when they contain data that is affected infrequently.
Lastly, reduced index storage costs. Creating a filtered index can reduce disk storage for nonclustered indexes when a full-table index is not necessary. However, you can replace a full-table nonclustered index with multiple filtered indexes without significantly increasing the storage requirements.

Filtered indexes are useful when columns contain well-defined subsets of data that queries reference in SELECT statements. Examples are:

Firstly, sparse columns that contain only a few non-NULL values.
Secondly, heterogeneous columns that contain categories of data.
Thirdly, columns that contain ranges of values such as dollar amounts, time, and dates.
Lastly, table partitions that are defined by simple comparison logic for column values.

Design Considerations

In order to design effective filtered indexes, it is important to understand what queries your application uses and how they relate to subsets of your data. Some examples of data that have well-defined subsets are columns with mostly NULL values, columns with heterogeneous categories of values and columns with distinct ranges of values. However, the following design considerations give a variety of scenarios for when a filtered index can provide advantages over full-table indexes.

Filtered Indexes for subsets of data

When a column only has a small number of relevant values for queries, you can create a filtered index on the subset of values. For example, the AdventureWorks2012 database has a Production.BillOfMaterials table with 2679 rows. The EndDate column has only 199 rows that contain a non-NULL value and the other 2480 rows contain NULL. Howver, the following filtered index would cover queries that return the columns defined in the index and that select only rows with a non-NULL value for EndDate.

SQL
CREATE NONCLUSTERED INDEX FIBillOfMaterialsWithEndDate
ON Production.BillOfMaterials (ComponentID, StartDate)
WHERE EndDate IS NOT NULL ;
GO

The filtered index FIBillOfMaterialsWithEndDate is valid for the following query. You can display the query execution plan to determine if the query optimizer used the filtered index.

SQL
SELECT ProductAssemblyID, ComponentID, StartDate
FROM Production.BillOfMaterials
WHERE EndDate IS NOT NULL
AND ComponentID = 5
AND StartDate > ‘20080101’ ;

Filtered Indexes for heterogeneous data

When a table has heterogeneous data rows, you can create a filtered index for one or more categories of data.

For example, the products listed in the Production.Product table are each assigned to a ProductSubcategoryID, which are in turn associated with the product categories Bikes, Components, Clothing, or Accessories. These categories are heterogeneous because their column values in the Production.Product table are not closely correlated. For example, the columns Color, ReorderPoint, ListPrice, Weight, Class, and Style have unique characteristics for each product category. Suppose that there are frequent queries for accessories which have subcategories between 27 and 36 inclusive. Further, you can improve the performance of queries for accessories by creating a filtered index on the accessories subcategories as shown in the following example.

SQL
CREATE NONCLUSTERED INDEX FIProductAccessories
ON Production.Product (ProductSubcategoryID, ListPrice)
Include (Name)
WHERE ProductSubcategoryID >= 27 AND ProductSubcategoryID <= 36;

The filtered index FIProductAccessories covers the following query because the query results are contained in the index and the query plan does not include a base table lookup. For example, the query predicate expression ProductSubcategoryID = 33 is a subset of the filtered index predicate ProductSubcategoryID >= 27 and ProductSubcategoryID <= 36, the ProductSubcategoryID and ListPrice columns in the query predicate are both key columns in the index, and name is stored in the leaf level of the index as an included column.

SQL
SELECT Name, ProductSubcategoryID, ListPrice
FROM Production.Product
WHERE ProductSubcategoryID = 33 AND ListPrice > 25.00 ;

Hash Index Design Guidelines

All memory-optimized tables must have at least one index, because it is the indexes that connect the rows together. On a memory-optimized table, every index is also memory-optimized. Hash indexes are one of the possible index types in a memory-optimized table.

Hash Index Architecture

A hash index consists of an array of pointers, and each element of the array is called a hash bucket.

Firstly, each bucket is 8 bytes, which are used to store the memory address of a link list of key entries.
Secondly, each entry is a value for an index key, plus the address of its corresponding row in the underlying memory-optimized table.
Lastly, each entry points to the next entry in a link list of entries, all chained to the current bucket.

The number of buckets must be specified at index definition time:

Firstly, the lower the ratio of buckets to table rows or to distinct values, the longer the average bucket link list will be.
Secondly, short link lists perform faster than long link lists.
lastly, the maximum number of buckets in hash indexes is 1,073,741,824.

The hashing function used for hash indexes has the following characteristics:

Firstly, SQL Server has one hash function that is used for all hash indexes.
Secondly, the hash function is deterministic. The same input key value is always mapped to the same bucket in the hash index.
Thirdly, multiple index keys may be mapped to the same hash bucket.
Then, the hash function is balanced, meaning that the distribution of index key values over hash buckets typically follows a Poisson or bell curve distribution, not a flat linear distribution.
Next, poisson distribution is not an even distribution. Index key values are not evenly distributed in the hash buckets.
Lastly, if two index keys are mapped to the same hash bucket, there is a hash collision. A large number of hash collisions can have a performance impact on read operations. A realistic goal is for 30% of the buckets contain two different key values.

hekaton_tables_23d — Image Source: Microsoft

Memory-Optimized Nonclustered Index Design Guidelines

Nonclustered indexes are one of the possible index types in a memory-optimized table.

In-memory Nonclustered Index Architecture

In-memory nonclustered indexes are implemented using a data structure called a Bw-Tree, originally envisioned and described by Microsoft Research in 2011. A Bw-Tree is a lock and latch-free variation of a B-Tree.

However, at a very high level the Bw-Tree can be understood as a map of pages organized by page ID (PidMap), a facility to allocate and reuse page IDs (PidAlloc) and a set of pages linked in the page map and to each other. These three high level sub-components make up the basic internal structure of a Bw-Tree.

Just like hash indexes, multiple data rows can be linked together (versions). The page pointers between the levels are logical page IDs, which are offsets into a page mapping table, that in turn has the physical address for each page. Further, there are no in-place updates of index pages. New delta pages are introduced for this purpose.

Firstly, no latching or locking is required for page updates.
Secondly, index pages are not a fixed size.

Delta Consolidation

A long chain of delta records can eventually degrade search performance as it could mean we are traversing long chains when searching through an index. However, i a new delta record is added to a chain that already has 16 elements, the changes in the delta records will be consolidated into the referenced index page, and the page will then be rebuilt, including the changes indicated by the new delta record that triggered the consolidation. The newly rebuilt page will have the same page ID but a new memory address.

Split page

An index page in Bw-Tree grows on as-needed basis starting from storing a single row to storing a maximum of 8 KB. Once the index page grows to 8 KB. Then, a new insert of a single row will cause the index page to split. For an internal page, this means when there is no more room to add another key value and pointer. And for a leaf page, it means that the row would be too big to fit on the page once all the delta records are incorporated.

However, a Split operation is done in two atomic steps. In the picture below, assume a Leaf-page forces a split because a key with value 5 is being inserted. And a non-leaf page exists pointing to the end of the current Leaf-level page (key value 4).

Split (Identify and implement index changes for queries) — Image Source: Microsoft

Step 1:

Allocate two new pages P1 and P2, and split the rows from old P1 page onto these new pages. This also include the newly inserted row. A new slot in Page Mapping Table is used to store the physical address of page P2. These pages, P1 and P2 are not accessible to any concurrent operations yet. In addition, the logical pointer from P1 to P2 is set. Then, in one atomic step update the Page Mapping Table to change the pointer from old P1 to new P1.

Step 2:

The non-leaf page points to P1 but there is no direct pointer from a non-leaf page to P2. P2 is only reachable via P1. To create a pointer from a non-leaf page to P2, allocate a new non-leaf page (internal index page), copy all the rows from old non-leaf page, and add a new row to point to P2. Once this is done, in one atomic step. Then, update the Page Mapping Table to change the pointer from old non-leaf page to new non-leaf page.

Merge page

When a DELETE operation results in a page having less than 10% of the maximum page size (currently 8 KB), or with a single row on it, that page will be merged with a contiguous page.

When a row is deleted from a page, a delta record for the delete is added. Additionally, a check is made to determine if the index page (non-leaf page) qualifies for Merge. This check verifies if the remaining space after deleting the row will be less than 10% of maximum page size. If it does qualify, the Merge is performed in three atomic steps.

Identify and implement index changes for queries DP-300 online course

Reference: Microsoft Documentation

Go back to DP-300 Tutorials