top of page

A Comprehensive Guide to Hash Tables in C

Introduction


So far, we've taken a look at arrays, which use random access to select elements from a specific index, and linked lists, which let us insert, delete, and resize the list efficiently.


Hash tables are a new data structure that combine the random access feature of arrays with the dynamism of linked lists.


Let's take a closer look at hash tables--what they are, when they're used, and how to use them.


What Exactly Are Hash Tables?


Hash tables are data structures that pair a value with a key using a special hash function.


Hash functions take in a value and spit out a non-negative integer value called a hash code that maps that value to a certain key (or index) within an array.


In Python, dictionaries are an example of a hash table with the hash function abstracted away!


With a good hash table, you should:

  • have an idea of where the data will be placed in relation to others.


Similarly, a good hash function should:

  • use all of data being hashed

  • use only data being hashed

  • outputs the same hash code every time the same input is passed

  • even data distribution

    • ex: data can be placed across a wide range of indices

  • generate very different codes for similar data


You can find many good hash functions across the internet, but as an example of a simple (and not so great) one:

// length of our hash table's array
#define HASH_LEN 20;

unsigned int hash(int x)
{
	return x % HASH_LEN;
}

And similarly, for a string...

// length of our hash table's array
#define HASH_LEN 20;

// adds the ASCII values of each char in the string
unsigned int hash(char *str)
{
	int sumASCII = 0;
	for (int i = 0; str[i] != '\0'; i++)
	{
		sum += str[i];
	}
	return sum % HASH_LEN;
}

Let's take an example to visualize this better:

Collisions and Linear Probing


One problem that we might run into is what if two pieces of data yield the same hash code, resulting in a collision.


Well, we can't just remove one of the two elements, so a quick fix is to use linear probing: putting the data nearby so that efficient lookup and insertion is still preserved.


For instance, if "Robert" also yielded a hash code of 2, colliding with "Cat" in the example above, we could look one or two (or more) slots above or below that index so that when we go searching for that element, we don't have to go too far.


Clustering and Chaining


But linear probing is like holding a falling building together with duct tape--each time you add an element to the wrong index, you double the likelihood of having to linear probe again, resulting in an inefficient mess. This is called clustering.


Luckily, computer scientists have found a solution to clustering: chaining.


Here's where linked lists come into play. Instead of storing just one int or string value within our array, why don't we store a pointer to the head of a linked list?


You can visualize this as each index in our array storing its own linked list that can now hold more than one value if the same hash code is ever generated!


Advantages of Array and Linked List Hash Tables


We know that linked lists are really good at insertion, as you just tack on a node at the head of the list.


At the same time, linked lists aren't amazing at lookup, since we need to rely on linear search... but now, we have an arbitrary number of linked lists, meaning that we've broken up that search into n pieces.


Final Thoughts


While hash tables are usually good across the board, one area they struggle with is sorting, so keep that in mind when choosing which data structure is best for a task.


Thanks for reading!

留言


bottom of page