Data Structures: Hash Tables

What is a hash table?

A hash table is a relatively complex non-sequential data structure that came as a clever solution to a mathematical problem: let us imagine we have a library with a large number of books. How do we find a book (by name) in least amount of time? Solution is to solve this problem in a series of steps:

  1. make a projection how many ITEMS ("books") will be in your store ("library"). This number will henceforth be known as CAPACITY!
  2. build a CAPACITY number of slots to store ITEMS into, aka "buckets". Each BUCKET will be a List holding one or more ITEMS ("books")!
  3. create a List of CAPACITY size holding BUCKET items. This will be your store ("library")!
  4. create a hash function able to calculate for each ITEM searched an INDEX greater or equal to zero, smaller than CAPACITY
  5. run hash function and get from List @ point #3 the BUCKET at position INDEX.
  6. create a comparator function able (at least) to compare two ITEMS for equality.
  7. iterate List @ BUCKET found @ step #5 and apply comparator function for each ITEM inside until a match is found.

What are the advantages of this solution?

A good hash function would provide an even spread of ITEMS through BUCKETS so that each bucket will store one item only. When that happens, any search will take constant time O(1) complexity. If, however, hash function fails to spread items evenly, we will have collisions, meaning more than one ITEM / BUCKET. When that happens, search will take O(items_count@BUCKET) complexity.

What kind of list implementation a hash table should use?

Regarding List implementation chosen @ inside buckets (point #2 above):

Conclusion: because a bucket in theory should store no more than a few entries thanks to a good hash function, Dynamic Array with its enormous speed/memory advantages should be chosen!

Regarding List implementation chosen to store buckets (point #3 above):

Conclusion: because practice shows us rate of inserts/lookups is almost always far higher than that of deletes, Dynamic Array should generally be solution of choice!

What happens when CAPACITY is exceeded?

When SIZE (number of items) exceeds CAPACITY, hash table needs to grow by doubling its CAPACITY (so doubling amount of BUCKETS), same as a dynamic array. However, unlike latter, growth is much more costly since it will have to: allocate the new buckets, get each ITEM inside old buckets, apply hash function to get INDEX, store items in BUCKET found at POSITION index, deallocate old buckets.

If Dynamic Array is chosen as List implementation (as advised above), growth is quite costly. When size of map is already known, it's MUCH more optimal to use a reserved size so no space/speed is wasted.

How does it work?

As a high performance data structure, hash table can be used to implement many abstract data types. Let us take its Associative Array implementation as an example to understand how it works:

Operation Description
clear Deallocates BUCKETS and BUCKETS LIST, resets SIZE to zero.
containsKey Applies hash function to KEY to get an INDEX, gets BUCKET @ INDEX, loops through all ENTRIES inside and applies comparator function to see if any ENTRY key matches
containsValue Loops through all BUCKETS, loops through all ENTRIES inside each BUCKET and applies comparator function to see if any ENTRY value matches
get Applies hash function to KEY to get an INDEX, gets BUCKET @ INDEX, loops through all ENTRIES inside, applies comparator function to see if ENTRY key matches then returns matching ENTRY value
set Applies hash function to KEY to get an INDEX, gets BUCKET @ INDEX, loops through all ENTRIES inside, applies comparator function to see if ENTRY key matches then changes matching ENTRY value (if found) or adds a new ENTRY (if not), also incrementing SIZE
removeKey Applies hash function to KEY to get an INDEX, gets BUCKET @ INDEX, loops through all ENTRIES inside, applies comparator function to see if ENTRY key matches then deletes it, also decrementing SIZE
removeValue Loops through all BUCKETS, loops through all ENTRIES inside each BUCKET, applies comparator function to see if any ENTRY value matches then deletes it, also decrementing SIZE
isEmpty Checks if SIZE is zero
size Gets value of SIZE (number of elements in structure)

Overall, hash table presents itself with a very favorable speed and memory efficiency profile, which is why it should be solution of choice for Associative Array and Set implementation. It can also be used for many other implementations (storing nodes/vertexes/edges in a tree/graph, etc). That being said, it also comes with disadvantages along with advantages:


Share