26. Thinking about Joins

You will make much better choices about which type of join to use (and understand your results better) if you understand the underlying data well. If you feel like there's something about your data that you don't understand fully, it's best to do some exploration before diving into writing your join. You should be able to answer four questions about your data before writing a join:

  1. What type of relationship exists between my tables: one-to-one, one-to-many, or many-to-many?
  2. Do any of the fields I'm using in my on clause contain NULL values?
  3. Is there guaranteed to be at least one matching row in the other table for each row in each of my tables?
  4. Do I need to include unmatched rows in my results?

If you already have a lot of experience with the underlying data and you've been writing SQL for a few months, you'll probably be able to answer these questions without even thinking about it most of the time. But if you're ever unsure, it doesn't hurt to investigate a little bit before writing your code. Here are some tips for where to start.

Types of Relationships

The easiest way to check what kind of relationship exists between your two tables is to check for duplicates on your join key (the column(s) you're using in your on clause). Let's say we don't know the relationship between accounts and transactions. We know that we can connect these two tables using account_id, but is the relationship:

  • One-to-one
  • One-to-many
  • Many-to-many

Well, let's check. First, we will look to see if there are any duplicates on account_id in our accounts table. Write an SQL query to check.

No Results

Ok, so there are no duplicates on account_id in accounts. This makes since because we know that account_id is the primary key of accounts, which guarantees uniqueness. But it's still a good idea to check if you're unsure. This means we're dealing with either a one-to-one or one-to-many relationship. Let's check transactions:

No Results

In transactions, we see plenty of duplicates on account_id. This also makes sense, because we know many donors will give more than one gift in their lifetime. So this means that for every one account_id in accounts, there can be many in transactions. This is a one-to-many relationship. The type of relationship tells you something about how many rows you'll have in your results:

  • One-to-one – No duplicates mean that there will be no duplicated data in your results. There will be one row for every account_id in our example.
  • One-to-many – One row in the first table matches multiple rows in the second. This means some data (usually from the "one" side) will be repeated in your results. For example, one account with 10 transactions will produce 10 rows—one for each transaction that matches the same account_id.
  • Many-to-many – 🚨Danger, Will Robinson! This relationship can create a lot of rows, especially if both sides contain duplicates on the join key. Many-to-many joins are less common, but when they happen, they're prone to generating large, messy result sets with excessive duplication. Proceed with caution and check your results carefully.

NULL Values

Next, we need to determine whether our join key (account_id in this example) is NULL in either of our tables. We can check this with a couple simple queries on accounts and transactions:

No Results
No Results

There are no NULL values in either of our tables for our join keys. Because NULL cannot be compared with typical operators like = and <>, if there are NULL values in one of your join key columns you will need to handle that data differently.

Unmatched Rows

Next, we will check to see if every row in accounts has at least one matching row in transactions and vice versa. One way we can do this is with not exists. Write an SQL query to perform these checks.

No Results

These results tell us that every row in accounts has at least one matching row in transactions and vice versa. When this is the case, you can use either an inner join or a left join and the results will be the same. If one or both of these queries had returned some results, it would indicate that an inner join would lose some rows, so you should consider whether you need all the data (when you can use a left join) or whether you do not (when you can proceed with inner join).

What Do I Need?

There's no simple test here, you just need to turn on your brain and think critically. If the relationships between your tables are such that there will be some data lost if you use inner join, is that desirable or useful? For the rows that would match, is there data from the second table that you need to bring into the results? If so, you will need a left join.