26. Thinking about Joins
You will make much better choices about which type of join
to use (and understand your results better) if you understand the underlying data well. If you feel like there's something about your data that you don't understand fully, it's best to do some exploration before diving into writing your join
. You should be able to answer four questions about your data before writing a join
:
- What type of relationship exists between my tables: one-to-one, one-to-many, or many-to-many?
- Do any of the fields I'm using in my
on
clause containNULL
values? - Is there guaranteed to be at least one matching row in the other table for each row in each of my tables?
- Do I need to include unmatched rows in my results?
If you already have a lot of experience with the underlying data and you've been writing SQL for a few months, you'll probably be able to answer these questions without even thinking about it most of the time. But if you're ever unsure, it doesn't hurt to investigate a little bit before writing your code. Here are some tips for where to start.
Types of Relationships
The easiest way to check what kind of relationship exists between your two tables is to check for duplicates on your join key (the column(s) you're using in your on
clause). Let's say we don't know the relationship between accounts
and transactions
. We know that we can connect these two tables using account_id
, but is the relationship:
- One-to-one
- One-to-many
- Many-to-many
Well, let's check. First, we will look to see if there are any duplicates on account_id
in our accounts
table. Write an SQL query to check.
Ok, so there are no duplicates on account_id
in accounts
. This makes since because we know that account_id
is the primary key of accounts
, which guarantees uniqueness. But it's still a good idea to check if you're unsure. This means we're dealing with either a one-to-one or one-to-many relationship. Let's check transactions
:
In transactions
, we see plenty of duplicates on account_id
. This also makes sense, because we know many donors will give more than one gift in their lifetime. So this means that for every one account_id
in accounts
, there can be many in transactions
. This is a one-to-many relationship. The type of relationship tells you something about how many rows you'll have in your results:
- One-to-one – No duplicates mean that there will be no duplicated data in your results. There will be one row for every account_id in our example.
- One-to-many – One row in the first table matches multiple rows in the second. This means some data (usually from the "one" side) will be repeated in your results. For example, one account with 10 transactions will produce 10 rows—one for each transaction that matches the same
account_id
. - Many-to-many – 🚨Danger, Will Robinson! This relationship can create a lot of rows, especially if both sides contain duplicates on the join key. Many-to-many joins are less common, but when they happen, they're prone to generating large, messy result sets with excessive duplication. Proceed with caution and check your results carefully.
NULL
Values
Next, we need to determine whether our join key (account_id
in this example) is NULL
in either of our tables. We can check this with a couple simple queries on accounts
and transactions
:
There are no NULL
values in either of our tables for our join keys. Because NULL
cannot be compared with typical operators like =
and <>
, if there are NULL
values in one of your join key columns you will need to handle that data differently.
Unmatched Rows
Next, we will check to see if every row in accounts
has at least one matching row in transactions
and vice versa. One way we can do this is with not exists
. Write an SQL query to perform these checks.
These results tell us that every row in accounts
has at least one matching row in transactions
and vice versa. When this is the case, you can use either an inner join
or a left join
and the results will be the same. If one or both of these queries had returned some results, it would indicate that an inner join
would lose some rows, so you should consider whether you need all the data (when you can use a left join
) or whether you do not (when you can proceed with inner join
).
What Do I Need?
There's no simple test here, you just need to turn on your brain and think critically. If the relationships between your tables are such that there will be some data lost if you use inner join
, is that desirable or useful? For the rows that would match, is there data from the second table that you need to bring into the results? If so, you will need a left join
.
Tip
When in doubt, start with a LEFT JOIN and see whether unmatched rows appear. Are they useful? Do we need them? This can teach you something new about your data and help understand which join type is appropriate.