DBMS: Indexes By Abdul Malik
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
In a relational database, an index is a copy of one part of a table. Some databases extend the power of indexing by allowing indexes to be created on functions or expressions. For example, an index could be created on upper(last_name), which would only store the upper case versions of the last_name field in the index. Another option sometimes supported is the use of "filtered" indexes, where index entries are created only for those records that satisfy some conditional expression. A further aspect of flexibility is to permit indexing on user-defined functions, as well as expressions formed from an assortment of built-in functions.
Indexes may be defined as unique or non-unique. A unique index acts as a constraint on the table by preventing duplicate entries in the index and thus, the backing table. Microsoft SQL Server allows a unique index to contain a single NULL value.
Index Architecture
Index architectures can be classified clustered or unclustered.
Unclustered
An unclustered index is structured in a way that doesn't correspond to the order of the actual data records. It resembles the words index at the back of a book. For this reason, it will perform worse than clustered indexes on ranged searches where the result set is large, since each result could cost an additional I/O-operation to get the actual data record. One can have any number of these kinds of indexes—all that is required is the space to store the index itself; one does not copy the actual data to create a new index.
Clustered
Clustering alters the data block into a certain distinct order to match the index, hence it is also an operation on the data storage blocks as well as on the index. An address book ordered by first name resembles a clustered index in its structure and purpose. The exact operation of database systems vary, but because storing data is very redundant the row data can only be stored in one order. Therefore, only one clustered index can be created on a given database table. Clustered indexes can greatly increase overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index, or when a range of items are selected.
Since the physical records are in this sort order on disk[AM1] , the next row item in the sequence is immediately before or after the last one, and so fewer data block reads are required. The primary feature of a clustered index is therefore the ordering of the physical data rows in accordance with the index blocks that point to them. Some databases separate the data and index blocks into separate files, others put two completely different data blocks within the same physical file(s).
Index implementations
Indexes can be implemented using a variety of data structures. Popular indexes include balanced trees, B+ trees and hashes.
In Microsoft SQL Server, the leaf node of the clustered index corresponds to the actual data, not simply a pointer to data that resides elsewhere, as is the case with a non-clustered index. Each relation can have a single clustered index and many unclustered indexes.
Applications and limitations
Indexes are useful for many applications but come with some limitations. Consider the following SQL statement:
SELECT first_name FROM people WHERE last_name = 'Smith';.
To process this statement without an index the database software must look at the last_name column on every row in the table (this is known as a full table scan). With an index the database simply follows the b-tree data structure [AM2] until the Smith entry has been found; this is much less computationally expensive than a full table scan.
Consider this SQL statement:
SELECT email_address FROM customers WHERE email_address LIKE '%@yahoo.com';.
This query would yield an email address for every customer whose email address ends with "@yahoo.com", but even if the email_address column has been indexed the database still must perform a full table scan. This is because the index is built with the assumption that words go from left to right. With a wildcard at the beginning of the search-term, the database software is unable to use the underlying b-tree data structure (in other words, the WHERE-clause is not sargable). This problem can be solved through the addition of another index created on reverse(email_address) and a SQL query like this:
SELECT email_address FROM customers WHERE reverse(email_address) LIKE reverse('%@yahoo.com');. This puts the wild-card at the right-most part of the query (now moc.oohay@%) which the index on reverse(email_address) can satisfy.
Types
Bitmap index
A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (bitmaps) and answers most queries by performing bitwise logical operations on these bitmaps. The most commonly used index, such as B+trees, are most efficient if the values it indexes do not repeat or repeat a smaller number of times. In contrast, the bitmap index is designed for cases where the values of a variable repeat very frequently. For example, the gender field in a customer database usually contains two distinct values: male or female. For such variables, the bitmap index can have a significant performance advantage over the commonly used trees.
Dense index
A dense index in databases is a file with pairs of keys and pointers for every record in the data file. Every key in this file is associated with a particular pointer to a record in the sorted data file. In clustered indexes with duplicate keys, the dense index points to the first record with that key.[4]
Sparse index
A sparse index in databases is a file with pairs of keys and pointers for every block in the data file. Every key in this file is associated with a particular pointer to the block in the sorted data file. In clustered indexes with duplicate keys, the sparse index points to the lowest search key in each block.
Reverse index
A reverse key index reverses the key value before entering it in the index. E.g., the value 24538 becomes 83542 in the index. Reversing the key value is particularly useful for indexing data such as sequence numbers, where new key values monotonically increase.
Covering Index
In most cases, an index is used to quickly locate the data record(s) from which the required data is read. In other words, the index is only used to locate data records in the table and not to return data.
A covering index is a special case where the index itself contains the required data field(s) and can return the data.
Consider the following table (other fields omitted):
ID | Name | Other Fields |
12 | Plug | ... |
13 | Lamp | ... |
14 | Fuse | ... |
To find the Name for ID 13, an index on (ID) will be useful, but the record must still be read to get the Name. However, an index on (ID, Name) contains the required data field and eliminates the need to look up the record.
A covering index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slows down data insertion & update. To reduce such index size, some systems allow non-key fields to be included in the index. Non-key fields are not themselves part of the index ordering but only included at the leaf level, allowing for a covering index with less overall index size.
No comments:
Post a Comment