31. Subquery in Where Clause

The main use cases for subqueries in the where clause are in conjunction with the in and exists operators. Each of these operators requires a subquery to follow:

  • in requires a subquery that returns a single column that is interpreted by your database as a list of individual values.
  • exists requires a subquery that selects any value and returns either true or false based on the presence/absence of any value in the subquery.

A Subtle Difference

There is a distinction between the way that these two operators exist that I have not yet made explicit. The in operator requires the subquery to be executed only once. The entire subquery is executed once, generating a list of values that is more-or-less "plugged in" to your query directly. Let's look at this query for example:

select
    account_id
    ,transaction_date
    ,amount
from transactions
where account_id in (
    select 
        account_id
    from addresses
    where state = 'MO'
)

Here, we are getting some transaction data for people who live in the state of Missouri. The subquery is executed exactly one time because the list of account_id values in Missouri is not at all dependent on the outer query. That list will always be the same, regardless of which row in the transactions table the outer query is currently examining. It just builds a list, that's it.

Now let's consider an example with exists:

select 
    a.account_id
    ,a.first_name
    ,a.last_name
from accounts as a
where exists (
    select 'X'
    from transactions as t
    where t.account_id = a.account_id
        and t.amount >= 100
)

This query builds a list of IDs and names for donors who have ever made a gift of $100 or more. However, notice that in our exists subquery, we are referencing a table from our outer query in the where clause: where t.account_id = a.account_id. This feature makes this a correlated subquery, meaning a subquery that depends on the outer query to execute. You could not execute this subquery alone as there is no a table defined in its from clause. As such, this subquery executes one time for every row in the accounts table. It has to check for each record in accounts table if something exists in the transactions table.

There are times when either in or exists can be used. Depending on the particulars of your database and your data structure, it may be preferable to use one approach or the other for performance reasons. Understanding the difference in how these two queries are executed will help you choose the right tool for the job.

Exercise 1: Subquery

Major gifts needs a list of donors who have ever given a gift of $5,000 or more. Output should include:

  • Donor ID
  • First Name
  • Last Name
  • Street Address
  • City
  • State
  • Zip Code
  • Email (if available)

Write an SQL query using a subquery to generate this list.

No Results

Exercise 2: Advanced

Major gifts has come back to you with a revised request. Instead of building a list of donors who have ever given a single gift of at least $5,000, they want a list of all donors who have ever cumulatively given $5,000 or more. Rewrite your query from Exercise 1 to produce this list.

No Results

Scalar Subqueries

The third type of subquery you can use in the where clause is called a scalar subquery. These subqueries can be either correlated or uncorrelated, meaning you can construct them to execute once per row in the outer table (like exists) or only once for the entire query (like in). Because correlated scalar subqueries frequently come with performance downsides, I will only be reviewing uncorrelated scalar subqueries in this course.

Scalar subqueries return exactly one value – a single column and a single row. They are most often used to either retrieve or calculate a value from somewhere else in the database and use that value in a comparison in your query. Let's look at an example. In the query below, we are using a scalar subquery to get the overall average gift amount from our transactions table and using it to find gifts that are above average:

select
    t.account_id
    ,t.amount
from transactions as t
where t.amount > (select avg(amount) from transactions)

Here, our subquery returns exactly one value: the average amount from the transactions table. This is an uncorrelated query since it makes no reference to and does not depend on the outer query – you could execute the query as a standalone query and it would work just fine. It will return a single number which will be "plugged in" to our query, just like we saw with in.

In practice, you probably will not have much use for this type of subquery. But in the few cases where it's useful, it can be very useful.

Exercise 3: Scalar Subquery

Direct Marketing is looking to understand which of their campaigns have high volume. They would like you to generate a list of all source_code values in the database where the quantity is above average. Output should include:

  • Source Code
  • Quantity

Write an SQL query using a scalar subquery to generate this list.

No Results