33. Common Table Expressions

A common table expression (CTE) is a special type of subquery. Like subqueries in the from clause, they generate a temporary table that can be used in a larger SQL query. Unlike other subqueries, CTEs are defined at the start of a query using a special keyword: with. Here's an example of a CTE:

with averages as (
    select
        account_id
        ,avg(amount) as average_gift
    from transactions
    group by account_id
)

select
    a.account_id
    ,a.first_name
    ,a.last_name
    ,averages.average_gift
from accounts as a inner join averages on a.account_id = averages.account_id

In this query, we've built a CTE called averages that computes the lifetime average gift for every donor in the database. Then in our main query, we have referenced the averages table directly in our from clause, just like we would any other table. CTEs must be:

  • Set off and named with the with <name> as structure
  • Before the main query itself
  • Enclosed in parenthesis
  • Separated by commas (in the case of multiple CTEs in one query)

On the surface, CTEs function exactly like from clause subqueries that we covered in the last chapter. But there are several reasons to prefer CTEs over standard from clause subqueries.

  • Easier to Read. Because CTEs are pulled to the top of the query rather than embedded in the query itself, they are usually easier to read and understand. This is especially true for a query that needs multiple subqueries, or subqueries that contain dense logic and/or several joins.
  • Reusable. There will be times that you need to join a table to itself or otherwise re-use a table in query. CTEs can be directly re-used, while from clause subqueries cannot. We will look at an example of this later.
  • Recursion. CTEs can be used for recursion in SQL, meaning you can write a CTE thant refers to itself. This is not possible with from clause subqueries. This use of CTEs is beyond the scope of this course, but I've included this in the interest of completeness.

Generally, when you have the option, prefer using CTEs rather than from clause subqueries. Because they are more flexible and easier to read, you should try to use CTEs instead of subqueries where possible. Depending on your database and the particulars of your needs, there are sometimes performance-based reasons to prefer from clause subqueries. from clause subqueries are also often faster to write (none of the with boilerplate), so for quick work and one-off queries, using from clause subqueries is fine.

Exercise 1: Refactor

In the last chapter, I gave the following SQL query as an example of a from clause subquery. Rewrite this query using a CTE instead.

select
    t.account_id
    ,t.amount
    ,ag.lifetime_average_gift
from transactions as t
    inner join (
            select
                account_id
                ,avg(amount) as lifetime_average_gift
            from transactions
            group by account_id
        ) as ag on t.account_id = ag.account_id
where year(t.transaction_date) = 2021
No Results

Multiple CTEs

Just like with from clause subqueries, there is (practically) no limit to the number of CTEs you use in a single query. What's more, any CTE can reference all CTEs defined earlier in the same query. This makes chunking the work of a complex query into several discrete parts much easier. Let's look at a complex example.

The sustainer team wants to know how much the average sustainer who made their first ever sustainer gift in calendar year 2023 was worth in their first 6 months on file. To compute this, we will need to build a few tables:

  • A table that lists the first sustainer gift date for each account_id
  • A table that computes the total sustainer value from each account_id in the 6 months following their first sustainer gift
  • A table that aggregates this donor-level data into a single number for sustainers acquired in 2023

Let's do this in steps with CTEs:

with sustainer_join_dates as (
    select
        account_id
        ,min(transaction_date) as join_date
    from transactions
    where program = 'Sustainer'
    group by account_id
)

,donor_value as (
    select
        sjd.account_id
        ,sjd.join_date
        ,sum(case 
                when t.transaction_date between sjd.join_date and sjd.join_date + INTERVAL 6 MONTHS
                    then amount
                else 0
            end) as six_month_value
    from sustainer_join_dates as sjd
        inner join transactions as t on t.account_id = sjd.account_id
    where t.program = 'Sustainer'
    group by sjd.account_id
        ,sjd.join_date
)

select
    avg(six_month_value) as avg_six_month_value
from donor_value
where year(join_date) = 2023

There's a whole lot going on in this query, so let's break it down. First, we have a with statement that indicates we will have some CTEs in this query. The first CTE is named sustainer_join_dates and it builds the first logical block we need: a list of account_id along with the earliest ever sustainer gift for that ID. That CTE is followed by another CTE, separated by a comma. This second CTE is named donor_value. Note that we do not repeat the with keyword. We use with only one time to set off our list of CTEs and just separate them all with commas.

Our second CTE can use the result of our first CTE directly, as we can see. The donor_value CTE selects directly from our sustainer_join_dates CTE. This is a powerful design pattern that allows us to logically divvy up the work we need to do in complex query scenarios into smaller, more manageable pieces. Our second CTE takes the list of sustainer_join_date data and joins it to transactions to compute the total value contributed by each donor between their first sustainer payment date and the date 6 months later.

Finally, our CTE listing has ended which we know because we see our friend the select keyword which denotes the start of our main query. This main query can access all of the CTEs we defined above and use them however we see fit. In this case, all we need to do is aggregate the result of our second CTE, donor_value, for all 2023 sustainer join dates.

I cannot overstate how useful this design pattern is. You will turn to it over and over again to break complex business problems down into easily-solvable chunks. In this example we have in essence built two temporary tables, sustainer_join_dates and donor_value, from scratch and then used them to compute something that would be much more difficult to calculate any other way.

Exercise 2: Multiple CTEs

Membership is looking at cross-fund giving. They want to know, for all donors whose made their first ever Birds gift in 2021, how much revenue did they contribute, on average, to the Trees fund in their first 24 months on file. Output should be a single number: the average Trees value for these donors in their first 24 months. Write an SQL query using multiple CTEs to generate this data.

No Results