31. Subquery in Where Clause
The main use cases for subqueries in the where
clause are in conjunction with the in and exists operators. Each of these operators requires a subquery to follow:
in
requires a subquery that returns a single column that is interpreted by your database as a list of individual values.exists
requires a subquery that selects any value and returns either true or false based on the presence/absence of any value in the subquery.
A Subtle Difference
There is a distinction between the way that these two operators exist that I have not yet made explicit. The in
operator requires the subquery to be executed only once. The entire subquery is executed once, generating a list of values that is more-or-less "plugged in" to your query directly. Let's look at this query for example:
select
account_id
,transaction_date
,amount
from transactions
where account_id in (
select
account_id
from addresses
where state = 'MO'
)
Here, we are getting some transaction data for people who live in the state of Missouri. The subquery is executed exactly one time because the list of account_id
values in Missouri is not at all dependent on the outer query. That list will always be the same, regardless of which row in the transactions
table the outer query is currently examining. It just builds a list, that's it.
Now let's consider an example with exists
:
select
a.account_id
,a.first_name
,a.last_name
from accounts as a
where exists (
select 'X'
from transactions as t
where t.account_id = a.account_id
and t.amount >= 100
)
This query builds a list of IDs and names for donors who have ever made a gift of $100 or more. However, notice that in our exists
subquery, we are referencing a table from our outer query in the where
clause: where t.account_id = a.account_id
. This feature makes this a correlated subquery, meaning a subquery that depends on the outer query to execute. You could not execute this subquery alone as there is no a
table defined in its from
clause. As such, this subquery executes one time for every row in the accounts
table. It has to check for each record in accounts
table if something exists
in the transactions
table.
There are times when either in
or exists
can be used. Depending on the particulars of your database and your data structure, it may be preferable to use one approach or the other for performance reasons. Understanding the difference in how these two queries are executed will help you choose the right tool for the job.
Exercise 1: Subquery
Major gifts needs a list of donors who have ever given a gift of $5,000 or more. Output should include:
- Donor ID
- First Name
- Last Name
- Street Address
- City
- State
- Zip Code
- Email (if available)
Write an SQL query using a subquery to generate this list.
Exercise 2: Advanced
Major gifts has come back to you with a revised request. Instead of building a list of donors who have ever given a single gift of at least $5,000, they want a list of all donors who have ever cumulatively given $5,000 or more. Rewrite your query from Exercise 1 to produce this list.
Scalar Subqueries
The third type of subquery you can use in the where
clause is called a scalar subquery. These subqueries can be either correlated or uncorrelated, meaning you can construct them to execute once per row in the outer table (like exists
) or only once for the entire query (like in
). Because correlated scalar subqueries frequently come with performance downsides, I will only be reviewing uncorrelated scalar subqueries in this course.
Scalar subqueries return exactly one value – a single column and a single row. They are most often used to either retrieve or calculate a value from somewhere else in the database and use that value in a comparison in your query. Let's look at an example. In the query below, we are using a scalar subquery to get the overall average gift amount from our transactions
table and using it to find gifts that are above average:
select
t.account_id
,t.amount
from transactions as t
where t.amount > (select avg(amount) from transactions)
Here, our subquery returns exactly one value: the average amount from the transactions table. This is an uncorrelated query since it makes no reference to and does not depend on the outer query – you could execute the query as a standalone query and it would work just fine. It will return a single number which will be "plugged in" to our query, just like we saw with in
.
In practice, you probably will not have much use for this type of subquery. But in the few cases where it's useful, it can be very useful.
Exercise 3: Scalar Subquery
Direct Marketing is looking to understand which of their campaigns have high volume. They would like you to generate a list of all source_code
values in the database where the quantity is above average. Output should include:
- Source Code
- Quantity
Write an SQL query using a scalar subquery to generate this list.