33. Common Table Expressions
A common table expression (CTE) is a special type of subquery. Like subqueries in the from
clause, they generate a temporary table that can be used in a larger SQL query. Unlike other subqueries, CTEs are defined at the start of a query using a special keyword: with
. Here's an example of a CTE:
with averages as (
select
account_id
,avg(amount) as average_gift
from transactions
group by account_id
)
select
a.account_id
,a.first_name
,a.last_name
,averages.average_gift
from accounts as a inner join averages on a.account_id = averages.account_id
In this query, we've built a CTE called averages
that computes the lifetime average gift for every donor in the database. Then in our main query, we have referenced the averages
table directly in our from
clause, just like we would any other table. CTEs must be:
- Set off and named with the
with <name> as
structure - Before the main query itself
- Enclosed in parenthesis
- Separated by commas (in the case of multiple CTEs in one query)
On the surface, CTEs function exactly like from
clause subqueries that we covered in the last chapter. But there are several reasons to prefer CTEs over standard from
clause subqueries.
- Easier to Read. Because CTEs are pulled to the top of the query rather than embedded in the query itself, they are usually easier to read and understand. This is especially true for a query that needs multiple subqueries, or subqueries that contain dense logic and/or several joins.
- Reusable. There will be times that you need to join a table to itself or otherwise re-use a table in query. CTEs can be directly re-used, while
from
clause subqueries cannot. We will look at an example of this later. - Recursion. CTEs can be used for recursion in SQL, meaning you can write a CTE thant refers to itself. This is not possible with
from
clause subqueries. This use of CTEs is beyond the scope of this course, but I've included this in the interest of completeness.
Generally, when you have the option, prefer using CTEs rather than from
clause subqueries. Because they are more flexible and easier to read, you should try to use CTEs instead of subqueries where possible. Depending on your database and the particulars of your needs, there are sometimes performance-based reasons to prefer from
clause subqueries. from
clause subqueries are also often faster to write (none of the with
boilerplate), so for quick work and one-off queries, using from
clause subqueries is fine.
Exercise 1: Refactor
In the last chapter, I gave the following SQL query as an example of a from
clause subquery. Rewrite this query using a CTE instead.
select
t.account_id
,t.amount
,ag.lifetime_average_gift
from transactions as t
inner join (
select
account_id
,avg(amount) as lifetime_average_gift
from transactions
group by account_id
) as ag on t.account_id = ag.account_id
where year(t.transaction_date) = 2021
Multiple CTEs
Just like with from
clause subqueries, there is (practically) no limit to the number of CTEs you use in a single query. What's more, any CTE can reference all CTEs defined earlier in the same query. This makes chunking the work of a complex query into several discrete parts much easier. Let's look at a complex example.
The sustainer team wants to know how much the average sustainer who made their first ever sustainer gift in calendar year 2023 was worth in their first 6 months on file. To compute this, we will need to build a few tables:
- A table that lists the first sustainer gift date for each
account_id
- A table that computes the total sustainer value from each
account_id
in the 6 months following their first sustainer gift - A table that aggregates this donor-level data into a single number for sustainers acquired in 2023
Let's do this in steps with CTEs:
with sustainer_join_dates as (
select
account_id
,min(transaction_date) as join_date
from transactions
where program = 'Sustainer'
group by account_id
)
,donor_value as (
select
sjd.account_id
,sjd.join_date
,sum(case
when t.transaction_date between sjd.join_date and sjd.join_date + INTERVAL 6 MONTHS
then amount
else 0
end) as six_month_value
from sustainer_join_dates as sjd
inner join transactions as t on t.account_id = sjd.account_id
where t.program = 'Sustainer'
group by sjd.account_id
,sjd.join_date
)
select
avg(six_month_value) as avg_six_month_value
from donor_value
where year(join_date) = 2023
There's a whole lot going on in this query, so let's break it down. First, we have a with
statement that indicates we will have some CTEs in this query. The first CTE is named sustainer_join_dates
and it builds the first logical block we need: a list of account_id
along with the earliest ever sustainer gift for that ID. That CTE is followed by another CTE, separated by a comma. This second CTE is named donor_value
. Note that we do not repeat the with
keyword. We use with
only one time to set off our list of CTEs and just separate them all with commas.
Our second CTE can use the result of our first CTE directly, as we can see. The donor_value
CTE selects directly from our sustainer_join_dates
CTE. This is a powerful design pattern that allows us to logically divvy up the work we need to do in complex query scenarios into smaller, more manageable pieces. Our second CTE takes the list of sustainer_join_date
data and joins it to transactions
to compute the total value contributed by each donor between their first sustainer payment date and the date 6 months later.
Finally, our CTE listing has ended which we know because we see our friend the select
keyword which denotes the start of our main query. This main query can access all of the CTEs we defined above and use them however we see fit. In this case, all we need to do is aggregate the result of our second CTE, donor_value
, for all 2023 sustainer join dates.
I cannot overstate how useful this design pattern is. You will turn to it over and over again to break complex business problems down into easily-solvable chunks. In this example we have in essence built two temporary tables, sustainer_join_dates
and donor_value
, from scratch and then used them to compute something that would be much more difficult to calculate any other way.
Exercise 2: Multiple CTEs
Membership is looking at cross-fund giving. They want to know, for all donors whose made their first ever Birds gift in 2021, how much revenue did they contribute, on average, to the Trees fund in their first 24 months on file. Output should be a single number: the average Trees value for these donors in their first 24 months. Write an SQL query using multiple CTEs to generate this data.