Categories
database database-indexes indexing performance sql

How does database indexing work? [closed]

2791

Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?

For information on queries to index a field, check out How do I index a database column.

0

    4036

    Why is it needed?

    When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

    Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires (N+1)/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.

    Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

    What is indexing?

    Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

    The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

    How does it work?

    Firstly, let’s outline a sample database table schema;

    Field name       Data type      Size on disk
    id (Primary key) Unsigned INT   4 bytes
    firstName        Char(50)       50 bytes
    lastName         Char(50)       50 bytes
    emailAddress     Char(100)      100 bytes
    

    Note: char was used in place of varchar to allow for an accurate size on disk value.
    This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

    Example 1sorted vs unsorted fields

    Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

    A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

    Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

    Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

    Field name       Data type      Size on disk
    firstName        Char(50)       50 bytes
    (record pointer) Special        4 bytes
    

    Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

    Example 2indexing

    Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.

    Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.

    When should it be used?

    Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

    Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

    34

    • 9

      binary search can be done when the data is unique, am i right? although you mentioned that minimum cardinality is important, the algorithm wouldn’t be a simple binary search, how would this approximation (~log2 n) affect the process time?

      – shampoo

      Feb 3, 2013 at 10:20


    • 13

      @AbhishekShivkumar:Great question!I think the index table will have as many rows as there are in the data table. And as this field will have only 2 values(boolean with true/false) & say you want a record with value true,then you can only halve the result set in first pass, in second pass all your records have value true so there is no basis to differentiate,now you have to search the data table in linear fashion-hence he said cardinality should be considered while deciding the indexed column. In this case,it’s worthless to index on such a column. Hope I’m correct 🙂

      Jul 8, 2013 at 4:20

    • 7

      shouldn’t the number of block accesses in the average case be (N+1)/2. If we sum the number of block accesses for all possible cases, and divide it by the number of cases, then we have N*(N+1)/(2*n) which comes out to be (N+1)/2.

      – ajay

      Jan 30, 2014 at 12:11


    • 37

      I think there are a few typos in this answer, for example, in the sentence: “a far cry from the 277,778 block accesses required by the non-indexed table.” doesn’t the author mean 1,000,000 block accesses? 277,778 is the number of blocks required by the index itself. There seems to be a couple of other inaccuracies too 🙁

      – jcm

      Aug 24, 2014 at 4:02

    • 6

      @jcm He explained it in the “What is indexing section” – “Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.”

      – grinch

      Nov 12, 2014 at 20:32

    585

    Classic example “Index in Books”

    Consider a “Book” of 1000 pages, divided by 10 Chapters, each section with 100 pages.

    Simple, huh?

    Now, imagine you want to find a particular Chapter that contains a word “Alchemist“. Without an index page, you have no other option than scanning through the entire book/Chapters. i.e: 1000 pages.

    This analogy is known as “Full Table Scan” in database world.

    enter image description here

    But with an index page, you know where to go! And more, to lookup any particular Chapter that matters, you just need to look over the index page, again and again, every time. After finding the matching index you can efficiently jump to that chapter by skipping the rest.

    But then, in addition to actual 1000 pages, you will need another ~10 pages to show the indices, so totally 1010 pages.

    Thus, the index is a separate section that stores values of indexed
    column + pointer to the indexed row in a sorted order for efficient
    look-ups.

    Things are simple in schools, isn’t it? 😛

    6

    • 84

      really nice analogy! funny i didn’t make the connection between a book index and a db index

      – Yolo Voe

      Jul 11, 2018 at 22:22

    • 6

      This makes me think Library or Grocery Store Could you image not having an index at a grocery store? Where's The Beef?!? Oh its next to the Restrooms, a mop, and makeup

      – JayRizzo

      Sep 4, 2018 at 7:00

    • 5

      “But with an index page at the beginning, you are there.” What does “you are there” mean?

      Sep 13, 2018 at 10:48

    • 3

      Indices usually go at the back of books, while a table of contents goes in the front. But, that makes the analogy even better, since column order shouldn’t matter.

      Jul 9, 2019 at 3:19

    • I still do not exactly understand, so if there are n unique words how would index help me? it creates pointer for each word? If so it takes a lot of time to find that pointer maybe even same time then just scroll everything and find it in a default way

      – D0mm

      Sep 17, 2021 at 14:39

    328

    An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree or a hash table but it can be any other logic structure.

    3

    • 54

      +1 times a million for this answer, as I found this listing while trying to find a simple explanation what indexing essentially is.

      Jun 22, 2015 at 23:06

    • 5

      Let’s note that “just a data structure” doesn’t mean “additional to the data”. Some times it is (e.g. “non-clustered index”), some times it determines the layout of the data (e.g. “clustered index”).

      – Pablo H

      Aug 28, 2019 at 13:24

    • 3

      This is the best answer, an Index is basically like a Hashmap in which a get has O(1) complexity, whereas searching in a List is O(N)

      Mar 26, 2021 at 5:01