Apr 259 min read

Harvard's CS50x Week 5: Data Structures

On this roller coaster ride of a course, as C departs from Week 4 to reach its climax, here, in Week 5, we become slightly more familiarized with data structures – ways through which data is organized in memory.

Based on the diagram, you might have noticed the steep descent from C and into Python that awaits us in Week 6.

That aside, let’s review my personal notes and observations on Week 5’s lecture.

Lecture Notes

As it happens, data structures aren’t as foreign a concept to us as we might think – in fact, we’re already (fairly) well-versed in a specific type of data structure: arrays.

Recall that arrays contain values that are stored exactly one byte away from one another in memory. This will become significant as we start to explore other structures that store data at random addresses in memory.

So what exactly are data structures?

Data structures can be classified in two ways: the conventional data structures – “how” – and abstract data structures – “what”.

The difference?

Data structures describe precisely how data is organized and how tasks are done.

Abstract data structures are simply the concepts of data structures – the basic idea of what needs to be done, not how it’s done.

With this in mind, let’s break down Week 5.

Abstract Data Structures: Queues and Stacks

Both queues and stacks are apt examples of abstract data structures, as they are not fully-realized algorithms yet. We, as the programmers, have to actually implement them using structs.

Let’s start with queues:

FIFO “first in first out” – you can imagine this key characteristic of queues as something similar to a line at an amusement park. The first people in line are also the first to get on the ride.
So data first entered into a queue will be the first to be retrieved or accessed from it
Data can be enqueued or dequeued – that is, data can join the end of the queue or leave the queue once it reaches the front

Now onto stacks:

LIFO “last in first out” – this is similar to how cafeteria trays are stacked. The last ones stacked (at the top of the pile) are the first ones to be picked up
Data can be pushed – placed at the top of the stack – or popped – removed from the top of the stack

In the codespace, a stack might look something like this:

const int CAPACITY = 50;
typedef struct
{
    person people[CAPACITY];
    int size;
}
stack;

While CAPACITY is the maximum size of the people array of type person, size is the current size of the array.

Now let’s say there is just one person that occupies our array. The 49 slots that are allocated memory in the people array are gone to waste, making this stack inefficient.

If only our array could grow as we added more items to it…

Dynamic Memory Allocation: Resizing Arrays

In memory, recall that there exist these junk or garbage values that were once utilized by the computer, but are now free for us to use.

At the same time, there are also values being stored by other programs, variables, functions, etc.

Assuming that we have an array containing the values 1, 2, and 3, how could we store a fourth value within it?

Well, this would involve allocating a new area of memory by creating a new array of size four, copying our old array into our new one, and then finally adding a fourth value to our new array.

Firstly, a pointer list was allocated 12 bytes of memory, as within it we would store three ints, each using up four bytes of memory. The same was done with the tmp array but with 16 bytes instead, as we hoped to store four ints within it.

As a side note, we use the ‘sizeof()’ function instead of the integer number of bytes for each data type as it tends to vary depending on whether your computer runs on a 32-bit or 64-bit system.

Using a loop, the contents of list were copied into our tmp array, and, indexing into the final value of the array, we added our fourth value. Finally, we make our original list pointer point to tmp. Note how we freed list twice, as it pointed to two different places each time we used free().

Conveniently, within C comes the ‘realloc()’ function, which essentially does the dirty work of copying data from one place to the other. Using realloc() as follows, we can reallocate data from list to tmp, at the same time increasing the size of the array.

You can imagine how realloc() might be used to grow or shrink the size of a queue or stack dynamically!

Linked Lists

To review some powerful memory-related primitives in C:

struct – a user-defined data type
. – an operator that helps access variables within structs (via dot notation)
* – an operator that helps to declare a pointer (storing an address) or dereference a variable (going into that value in memory)

This week acquaints us with the -> operator, which essentially goes to an address and looks inside of a structure.

I like to think of this arrow operator as a close cousin to the dot operator, except for addresses instead of variables.

We will be using the -> operator in order to implement another abstract data structure known as a linked list. Where traditional arrays have items that are all adjacent to one another in memory, linked lists are especially powerful in that they allow you to include values located at differing locations in memory.

Try to envision data scattered randomly in memory. How would you go about stringing them together?

We could use even more memory to store the address of the concurrent item…

Note that the first element is there by convention, keeping track of the first item. Similar, NULL acts as our terminator, signaling that there is nothing next in the list.

With our knowledge of pointers, you can imagine the first item, a pointer storing the address 0x123, pointing to the concurrent variable with that address. Similarly, the pointer containing 0x456 will point to its respective variable, and so on…

In computer science terms, each of those little boxes containing an item and a pointer to the next item is called node.

typedef struct node
{
        int number;
        struct node *next;
}
node;

Within is an integer number and a pointer to a node next. Looking at this, my immediate thought was why do we use struct node to define our pointer next instead of just node?

As it happens, struct node (which appears after typedef) is synonymous to the node (which is the name of data type) that appears in the last line. To better understand this, we can can rewrite the code as follows:

struct node
{
    int number;
    struct node *next;
}
typedef struct node node;

Here, we’ve defined our struct node, almost like a function or a variable, and then used typedef at the end to equate struct node to node.

Anyways, while linked lists do help us dynamically add data to an array, there are some downsides…

More memory is required to be allocated to a linked list, as you not only have to store the item, but also the pointer to the next node.
Because linked lists don’t exist contiguously in memory, they cannot be indexed into (which relies on adding n bytes to the initial address)
Thus, to find an element in a linked list, we must search linearly (traversing the list n-1 times) – keep in mind that binary search does not work on linked lists

Let’s walk through how to go about making a linked list within C.

node *list;

Here, a node called list is declared in memory, but for now stores a garbage value.

node *n = malloc(sizeof(node));

Another node n is allocated enough memory to store one node, or one int and one struct node pointer. It, too, stores a garbage value.

n -> number = 1;

At this point, our pointer of type node n (which stores an address) goes into its number field (which, recall, stores an int) and sets it to 1. To better understand this arrow operator, you can visualize that, in essence, the code above is equivalent to the following:

(*n).number = 1;

We’ve dereferenced n, going into its address and accessing its value, and then used dot notation to access the number field of our node n.

n -> next = NULL

This simply sets the next field of our node n to NULL, signaling that the list terminates there.

list = n;

Now, we make list and n point to the same location.

At this stage, our linked list (visually) stands as follows:

node *n = malloc(sizeof(node));

A new node is created, filled with garbage values. Because we reuse the pointer n, it now points to the new node.

n -> number = 2;

The number field of n’s current node is updated to store the value 2.

n -> next = NULL;

Similarly, the next field is set to NULL.

n -> next = list;

Here’s where it gets conceptually tricky. Because we do not want to sever any connections to these nodes, we cannot simply have list point to n quite yet (otherwise our previous node storing 1 will be lost in memory as nothing points to it).

Instead, we must have the next field of our node n point to the address of list.

list = n;

Now, we can have list point to n.

Here is the fruit of our labor, visually:

Now, to turn theory into practice, an implementation of a linked list in the codespace might look something similar to this…

Most everything stayed consistent with our conceptual walkthrough. Some key differences, however, included the use of command line arguments.

The program looped through each argument, converting each element from the string array to ints. Once the linked list was formed, in order to print the numbers from the list, an additional pointer variable was required, ptr.

Imagine this variable as a node somewhere in memory, pointing to list, as written. On the first iteration, ptr accesses and prints the number field of list (which was 2 in our example). Then, the value of ptr updates to be the next section of the node list (or the node that list points to). Recall that within next is stored the address of the subsequent element.

Thus, while the value of ptr is not NULL (which signals the program to terminate), it iterates again and again!

Now, having allocated so much memory across our linked list, by convention, we must free all unused memory so as to not create problems within our computer later on. First ptr is set to the variable that points to the first element in our linked list: list. Now, until ptr reaches the NULL terminator, the program loops over the list, freeing any previously allocated memory within the next field of each of our nodes.

Dictionaries

Another example of a data structure you might have encountered (especially if you’ve had experience coding in Python) is a dictionary.

Dictionaries operate by pairing a certain value with its corresponding key – these are comparable to words and definitions in the dictionaries we’re probably more familiar with.

Because each key has exactly one value, we can almost instantaneously access the value stored within, resulting in a constant time complexity of O(1).

Trees

Binary search trees are similar to linked lists in that both store a sequence of data connected by pointers.

Trees, however, are built such that they come pre-sorted. Imagine a christmas tree–at the very tip of the tree is the central value. Values smaller than this number are placed to the left, and those larger, to the right.

In this example, four is the central value, and two and six are placed in accordance with that. Every number after is placed under new “nodes” instead of under four.

In order to connect these values to form a tree and not just floating values in memory, pointers point to the addresses of each of the nodes.

If we were to apply this data structure in C, you would see the biggest difference in the way these nodes are defined in a struct.

typedef struct node
{
    int number;
    struct node *left;
    struct node *right;
}
node;

In this case, our definition of a node is completely different from that of our former linked-list. Instead of one, there are two node fields, left and right, representing the two possible paths each number added to the tree has.

Hash Tables

Hash tables are a fusion of arrays and linked lists, or more technically, an array of pointers to nodes that can be visualized like such:

In this image, each of the 26 boxes are assigned a letter of the alphabet, and within each element of that array is a linked list of items starting with that letter.

Collisions occur when multiple values are added to the same location in the hash table. In the above case, when multiple names were added to the same array location, the new value was simply appended to the linked list.

What makes hash tables so dynamic is the sheer freedom the programmer has to manipulate how the table looks. For instance, to reduce collisions, we could change our array like this:

This choice of deciding the size of your array (within the hash table) depends on the goal of your program and the data that will be stored in your table.

For instance, if there was only exactly one name corresponding to each letter, increasing the size of our array to account for up to three letters would not make sense and would present other problems (e.g. excess memory usage or increased search time)

Tries

The final data structure that this lecture touches on is a trie (pronounced try, not to be confused with a tree).

While tries take up lots of memory, one advantage is their constant time complexity.

You can imagine a trie as follows:

Within this trie, both the string ‘Hagrid’ and ‘Harry’ are stored. Note that for each letter in a string (unless that letter has appeared before in the same spot in a different string) a new node is required.

Final Thoughts

You can finally let out a sigh of relief now that C is done and dusted. Week 6 will start afresh with Python, a relatively higher-level (less complex) language that will allow us to build even more complex programs with comparatively less code.

Until then, I hope you’ll give problem set 5 a trie!

See you soon! Meanwhile, stay tuned for updates by following my blog and LinkedIn page.

Harvard's CS50x Week 5: Data Structures

Recent Posts