29. Multiple Joins
So far, all of our join examples have featured exactly one join that combines two tables. But SQL is very flexible and allows you to join as many tables as you'd like in a single query. In practice, many queries you write will need to combine data from several tables to build a single resultset. This chapter will walk you through the subtleties of using multiple joins.
Working in Harmony
First, remember that the purpose of a join is to combine multiple tables into a single logical dataset from which you can select
, filter with where
, etc. No matter how many joins are in a single query, when your database is done resolving your from
clause and all the joins, it will be left with one big dataset. That dataset will be comprised of all the columns from all the tables you have listed and some subset of the rows from those tables according to your on
clauses.
I have found that the easiest way to manage multiple joins is to imagine that the database builds this big dataset one table at a time in the order in which you've written your joins. This is not strictly true, but if you write code with this assumption in mind, you cannot go wrong. Let's look at an example to see why I believe this assumption makes writing code easier. Here we have a simple inner join
:
This query joins accounts and transactions together, returning all columns from both tables for transactions in 2024. What happens if we add another inner join
to this query:
Let's apply the principle outlined above: imagine that the database builds this big dataset one table at a time in the order in which you've written your joins. If we think about it this way, the database first joins accounts
and transactions
. This results in some intermediate table that is identical to the output of our first query. Then to the output of that first join, it now joins the source_codes
table. This results in the final output, which is all columns from all three of these tables where the transaction_date
is in 2024. Let's add another one:
Applying our principle again, we can imagine the database builds this output as:
- Join
accounts
andtransactions
together and produce an intermediate result. - Take that intermediate result and join
source_codes
to it and produce another intermediate result. - Take that second intermediate result and join
emails
to it, producing our final result. - From that final result, apply the
where
clause, returning only rows wheretransaction_date
is in 2024.
In this chain of explanation, we've seen three queries. The first two each returned 668 pages of results. But the third one only returned 333 pages of results. What is different about the third query that reduced the number of results? Only half of the donors in our database have an email address on file, so when we inner join
the emails
table to the result of our first two inner joins
, we are only getting results back where the join conditions of that third join are true: where the donor has an entry in the emails
table.
If we consistently think of the joins being completed in stepwise fashion, with each join generating an intermediate resultset that is used as part of the next join, it is more clear how this result set is built.
Easier Troubleshooting
If you're not getting the results you expect from your series of joins, it can be helpful to remove them one by one and examine the results (basically working the process above but in reverse). Work your way back to a place where you have a set of joins that provides the output you expect, and then add joins in one by one, looking at the resulting data each time to verify that you understand:
- Which new data has become available as a result of the newly added join? Can I explain why?
- Which data that was previously available has now become unavailable as a result of the newly added join? Can I explain why?
- Did the row count in my results change, if so, how (increase or decrease) and by what magnitude? Can I explain why?
- Do I need to make any changes to the newly-added join to keep/restore/add/remove data that should/shouldn't be available?
Working stepwise on your from
clause in this way can solve a lot of problems. You cannot effectively troubleshoot your where
, group by
, select
, having
and order by
clauses if you do not first understand how the database is resolving your from
clause and all its joins.
Exercise 1: Multiple inner joins
Membership wants to send thank you messages to donors who gave in August 2023. They'd like a list of all August 2023 transactions with the following information appended:
- Account ID
- First Name
- Last Name
- Street Address
- City
- State
- Zip Code
- Transaction Date
- Transaction Amount
Write an SQL query using multiple joins to produce this data.
Exercise 2: Multiple inner joins
Development wants a list of all gifts of $5,000 or more, all-time, that includes:
- Account ID
- First Name
- Last Name
- Transaction Date
- Transaction Amount
- Source Code
- Source Code Channel
Write an SQL query to generate this list.
Mixing Join Types
The examples above used only inner join
, but many times you will need to mix inner join
and left join
in a single query. This happens most often when you are working with data that applies to only some of your donors or some of your transactions. In the data we are using for this course, the only table that contains this type of data is emails
, where not every donor in our database has an email on file.
Using multiple join types is straightforward, just include them like you would any other join. When you're getting started, I recommend including any inner joins
first, followed by left joins
. This is because inner join
can result in dropping/filtering out data (when there is no matching record in one of the tables), while left join
carries no such risk. By ordering your joins this way, you can make troubleshooting and logically thinking about your dataset a bit easier.
Exercise 3: Mixed Joins
The Events team wants to send both email and physical mail invitations to an upcoming fundraising event to all donors living in the state of Virginia. Output should include:
- Donor ID
- First Name
- Last Name
- Street Address
- City
- State
- Zip Code
- Email address
They want everyone from Virginia to be included in this list, even if they don't have an email. Write an SQL query to generate the list.