19. Group By
We've done about as much as we can using our trusty friends select
, from
, and where
and now it's time to learn about a new type of clause, group by
. The group by
clause is used in conjunction with aggregate functions to group rows that have the same values in specified columns so that you can apply aggregate functions to each group. Let's look at an example.
Let's say we've been asked to provide total revenue for 2023 broken out by Fund (Birds vs Trees). To do this, we will use sum(amount)
as we saw in the last chapter, and we will use group by
to split the total out by fund, like this:
Notice that if we want to see our aggregate value broken out by Fund, we include fund
in our select
statement, and then we include it again in our group by
statement. This query tells the database, "For all the rows in transactions where the date is in 2023, find each unique value for Fund and add up all the money that came in to that fund." This results in one row per unique fund
with the total amount of revenue for that fund. Some notes about group by
:
- Every column reference in your
select
statement that is not part of an aggregate function must be listed in yourgroup by
. - You can include non-aggregate functions (like
year()
,month()
,substring()
, etc.) in yourselect
andgroup by
clauses. - You can include
case
statements in yourselect
andgroup by
clauses as long as thecase
statements do not themselves contain aggregate functions. - You cannot include aggregate functions in your
group by
clause.group by
tells the database how to group the outputs of your aggregates. - You can only include expressions that contain column references in your
group by
; you cannot include bare literal values like a string'my string'
or a number like42
. - You can (and often will!) group by multiple columns.
Group by multiple columns
Grouping by multiple columns is as simple as adding additional column references to your select
and group by
clauses. For example, if we wanted to summarize 2023 revenue by both Fund and Channel we could modify the query above like so:
The result of this query is one row for every unique combination of fund
and transaction_channel
for transactions dated in 2023, with the sum of all amounts for each unique combination displayed in total_amount
.
Exercise 1: Group by
You need to know the date that each donor in the database joined your organization (the date of their first gift). Write an SQL query using min()
and group by
to list the first gift date for each account ID.
Exercise 2: Advanced group by
Your gift processing team is looking to move to a new credit card processing platform. As part of their work, they've requested a report that shows the total amount of money contributed via credit card, broken down by channel and the year of the transaction. Write an SQL query to generate this list.
Aggregates are Optional
Finally, I want to note that group by
can be used without any aggregate functions whatsoever. In practice, there are not many reasons you'd employ this design pattern, but I mention it in the interest of completeness. group by
without any aggregate functions does the same thing that it does when aggregate functions are present – it groups rows by unique values.
For example, we know that the query below will sum up the amount of money contributed to each fund and display the results by fund:
select
fund
,sum(amount) as total_amount
from transactions
group by fund
In this query, group by
functions as a way to tell the database, "I only want one row in my results for each unique value of fund." If we then remove the aggregate function altogether from our query, group by
performs exactly the same function:
We end up with one row for each unique value of fund in the transactions
table. When used in the absence of aggregate functions, you can think of group by
as a funhouse mirror version of select distinct
– it removes duplicate values. Because distinct
exists and is easier to read and understand (among a few other reasons beyond the scope of this course), if this is the kind of output you wanted, you'd probably just write this query as:
select
distinct fund
from transactions
It's shorter, more clear, and will produce identical results. Later in the course we will cover one scenario where you will sometimes use group by
without an aggregate function in your select
clause. For now, I just want you to know this is possible.