May 118 min read

Harvard's CS50x Week 7: Data Manipulation with SQL

Introduction

If you’re an avid social media user, chances are, you’ve probably encountered some form of a survey, be it about the content you tend to view or your preferences on certain products and services.

While (agreeably) annoying, these individual data are compiled in a database, making for huge amounts of valuable marketing information stored in just one place.

So how do companies comb through these extremely large sets of data to form business insights?

One such way is through SQL, a database creation and manipulation language. This week’s lecture combines Python (which we were introduced to in Week 6) with SQL, allowing us to not only harness data through the terminal, but also via running Python files.

With that in mind, let’s take a look at some of my personal notes and observations on Week 7’s lecture!

Lecture Notes

Professor David J. Malan begins this week by conducting his own survey.

Students inputted their favorite programming language and problem set via a Google Form, which was then enumerated in a Google Sheet:

Once this data was neatly columned in a table, the spreadsheet was downloaded as a csv, or a comma-separated values file.

Importing this csv into our codespace, we can use the csv library along with Python to help us traverse our file of favorites.

With our csv file set up, let’s break down Week 7.

Flat-File Databases: Rows and Columns

Databases in which all of our data is stored in a single table of rows and columns (within a text file of some sort) are called flat-file databases.

Let’s begin by writing a simple program that prints out the language column of favorites.csv:

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create reader
    reader = csv.reader(file)

    # Skip header row
    next(reader)

    # Iterate over CSV file, printing each favorite
    for row in reader:
        favorite = row[1]
        print(favorite)

Here, open() takes our csv as well as a parameter specifying which mode we would like to treat the file with – in this case, “r” represents reading mode. Next, we create a file reader and then iterate over all rows of the file, selecting the first column in each row (storing languages) and printing it.

Alternatively, we could access the “language” column by treating the file as a dictionary and indexing into a specific key as such:

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Iterate over CSV file, printing each favorite
    for row in reader:
        print(row["language"])

This time, reader stores a list of dictionaries, instead of a list of lists!

What if we wanted to count the occurrence of each individual language?

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Counts
    counts = {}

    # Iterate over CSV file, counting favorites
    for row in reader:
        favorite = row["language"]
        if favorite in counts:
            counts[favorite] += 1
        else:
            counts[favorite] = 1

# Print counts
for favorite in sorted(counts):
    print(f"{favorite}: {counts[favorite]}")

This program initializes an empty dictionary counts. Looping over the file, this program checks if a key-value pair already exists within counts with the current favorite language as the key – if so, one is added to that pair; if not, a new pair is created and set to one.

Finally, the dictionary is printed by alphabetical order of its keys as dictated by sorted(). In order to sort by descending order of values instead of keys, we can modify our code as follows:

def get_value(language):
    return counts[language]

# Print counts
for favorite in sorted(counts, key=get_value, reverse=True):
    print(f"{favorite}: {counts[favorite]}")

In this case, sorted() takes three parameters, one of which specifies key as a function get_value which we defined above the print loop.

For more concision, we can condense our code via the keyword lambda, which essentially helps us create a one-line, anonymous function.

for favorite in sorted(counts, key=lambda language: counts[language], reverse=True):
    print(f"{favorite}: {counts[favorite]}")

Notice the similarities in the syntax of this new lambda function and our old get_value() function.

Relational Databases: A Web of Tables

Unlike flat-file databases, relational databases are typically used for storing large sets of data. In fact, companies like Google and Meta use relational databases to store information.

SQL, or Structured Query Language, is a programming language that helps us manipulate and harness data within relational databases.

The acronym CRUD helps us to remember the four types of SQL commands:

Create

Read

Update

Delete

In order to initialize a SQL database into your codespace, write the following in the terminal:

sqlite3 favorites.db
Upon agreeing to create the database, the prompt will now change to sqlite3
.mode csv changes sqlite to csv mode
.import favorites.csv favorites
.schema allows you to view the structure of your database, including all of its tables
SELECT FROM favorites; returns every row from the favorites database. Note that signifies everything
SELECT language FROM favorites; returns every row from the language column
SELECT column FROM table; is the general form of the above commands

Some other useful commands in SQL include:

AVG – returns the average value of a numeric columnCOUNT – returns the number of rows in a table

DISTINCT – returns only non-duplicate valuesLOWER – converts a string to lower-caseMAX – returns the greatest numeric value in a numeric column

MIN – returns the smallest numeric value in a numeric column

UPPER – converts a string to upper-case

I could use SELECT DISTINCT(language) FROM favorites; to help me find the unique or non-recurring languages in my table.

Similarly, SELECT COUNT(DISTINCT(language)) FROM favorites; returns the number of distinct languages.

Now let’s move into Python!

import csv
from cs50 import SQL

# Open database
db = SQL("sqlite:///favorites.db")

# Prompt user for favorite
favorite = input("Favorite: ")

# Search for title
rows = db.execute("SELECT COUNT(*) FROM favorites WHERE problem LIKE ?", "%" + favorite + "%")

# Get first (and only) row
row = rows[0]

# Print popularity
print(row["COUNT(*)"]

Here, after importing the required libraries and opening the favorites database, db.execute() simply replicates typing a SQL command in the terminal.

In this case, rows stores a list of dictionaries that contain the number of problems in the favorites table that are like (similar to) the user inputted favorite problem.

Again, LIKE just compares strings super loosely. In this case, the program checks if the user inputted favorite is within any problem, indicated by the % symbols around it.

Finally, in a way close to C, the ? operator acts as a placeholder, which is filled in after the comma.

Because rows stores a list of dictionaries, we need to access the first and only element within that list, and then print the right value within the dictionary stored in rows[0]. We can use the key “COUNT(*)” which is the name of the column created after executing our command to access the popularity of a specific problem.

Here is a list of some other helpful commands:

WHERE – adding a Boolean expression to filter our data

LIKE – filtering responses more loosely

ORDER BY – ordering responses

LIMIT – limiting the number of responses

GROUP BY – grouping responses together

IMDb

One such example of a relational database is IMDb – a collection of people, shows, writers, stars, genres, and ratings.

Once shows.db is imported into your codespace, running .schema will print an admittedly less visually appealing version of this diagram:

Note how there are multiple tables, each connected by their respective id’s. Although each table contains its own distinct data, there is something common in almost all of them – people and shows contain id, while the rest contain either show_id or person_id.

This common field connection between tables is called a key.

It turns out, there are multiple types of keys. The tables people and shows contain something called a primary key, which is a table’s unique identifier. The other tables contain foreign keys, keys that are used to reference other tables by pointing to their primary keys.

Note how show_id always points to id under shows and person_id always points to id under people.

As for datatypes in sqlite, there are 5:

BLOB – binary large objects that are groups of ones and zeros

INTEGER – an integer

NUMERIC – for numbers that are formatted specially like dates

REAL – like a float

TEXT – for strings and the like

Further, columns can have constraints:

NOT NULL – ensures that a column always has a value (does not accept NULL values)

UNIQUE – ensures that all values within a column are different

Suppose we wanted to find the first ten comedy movies within our database. Our query could potentially look like this:

SELECT * FROM genres WHERE genre = ‘Comedy’ LIMIT 10;

And then based on the ten id’s returned…

SELECT * FROM shows WHERE id = (...);

This process would be tedious and poorly designed, as we’d have to write out this line over and over, copying each id meticulously.

A better way to tackle this problem would be:

SELECT title

FROM shows

WHERE id IN (

SELECT show_id

FROM genres

WHERE genre = ‘Comedy’)

ORDER BY title LIMIT 10;

ORDER BY title simply sorts each show by alphabetical order.

Take another example. What if I wanted to find every show in which Peter Dinklage stars?

SELECT title

FROM shows

WHERE id IN (

SELECT show_id

FROM stars

WHERE person_id IN (

SELECT *

FROM people

WHERE name = ‘Peter Dinklage’) );

But there is an even better approach to streamlining these types of queries!

JOIN Command

Recall that tables are connected via their primary and foreign keys. For instance, show_id connects to show’s id field.

So what if we could temporarily join these tables? Imagine a single table with id, title, year, and episodes fields, where id only contains those values that show_id also contains.

Let’s take this problem to the codespace using the keyword JOIN.

SELECT title FROM shows JOIN genres ON

shows.id = genres.show_id

WHERE genre = ‘Comedy’

ORDER BY title LIMIT 10;

We can further rewrite our query involving finding the shows a specific actor has starred in!

SELECT title FROM people

JOIN stars ON people.id = stars.person_id

JOIN shows ON stars.show_id = shows.id

WHERE name = ‘Peter Dinklage’;

Note the simplicity and better design of these code segments in comparison to our prior examples!

Perhaps an even better designed query could be as follows:

SELECT title FROM people, stars, shows

WHERE shows.id = stars.show_id

AND people.id = stars.person_id

AND name = ‘Peter Dinklage’;

One more thing to add – if, for instance, I couldn’t recall more than an actor’s first name and last initial, the % operator might come in handy:

SELECT title FROM people, stars, shows

WHERE shows.id = stars.show_id

AND people.id = stars.person_id

AND name LIKE ‘Peter D%’;

Here, % acts as a wildcard or placeholder, finding all instances of names beginning in ‘Peter D’.

Indexes

If you’ve stuck around for the past several weeks, you’ll have sensed a pretty broad theme across all programming languages – as a program’s runtime decreases, system resource usage increases (and vice versa)!

In SQL, runtime can be optimized via indexes:

CREATE INDEX title_index on shows(title); simply creates an index related to the title column of shows.

Upon running .timer on sqlite3 returning the runtime of the query, Professor Malan numerically proves the efficiency of indexing in comparison to simply running something like SELECT * FROM shows WHERE title = ‘The Office’; both with and without creating an index.

So why are indexes so quick?

When an index is created, under the hood a data structure called a B Tree is created, similar to a binary tree but with more than two nodes or “offspring”.

A reasonable question to ask would be why using indexes is not the default method of organizing data if it’s so quick, and that simply boils down to the programmer’s needs – storage optimization or speed.

SQL in Python

With this newly gleaned knowledge of SQL in mind, let’s circle back to our favorites.py file…

import csv
from cs50 import SQL

# Open database
db = SQL("sqlite:///favorites.db")

# Prompt user for favorite
favorite = input("Favorite: ")

# Search for title
rows = db.execute("SELECT COUNT(*) FROM favorites WHERE problem LIKE ?", "%" + favorite + "%")

# Get first (and only) row
row = rows[0]

# Print popularity
print(row["COUNT(*)"])

Note that the ‘SQL()’ function embedded with a link simply gives Python the location of our database within the codespace (which is a file in and of itself).

Now, instead of having to read our file as a CSV and then transfer all of that data into a dictionary, we can simply SELECT the number of rows in favorites that contain the user-inputted favorite.

Finally, don’t forget that db.execute() always returns a list of dictionaries!

Final Thoughts

Closing off with something more light, this car tried to outsmart the license plate reader with a SQL injection attack.

If successful, the entire data table containing all license-plate-related data would have deleted from the system.

Anyways, if you enjoyed dabbling in SQL as much as I did, don’t worry – it’ll resurface in the next couple of weeks when we start on web design.

As per usual, I hope you’ll Seek-QL out this week’s Problem Set. This was Week 7!

Meanwhile, stay tuned for updates by following my blog and LinkedIn page.

Harvard's CS50x Week 7: Data Manipulation with SQL

Recent Posts

Comments

I value your feedback.
Drop a line to let me know what you think.

hello@sabirseth.com