19. Group By
We've done about as much as we can using our trusty friends select, from, and where and now it's time to learn about a new type of clause, group by. The group by clause is used in conjunction with aggregate functions to group rows that have the same values in specified columns so that you can apply aggregate functions to each group. Let's look at an example.
Let's say we've been asked to provide total revenue for 2023 broken out by Fund (Birds vs Trees). To do this, we will use sum(amount) as we saw in the last chapter, and we will use group by to split the total out by fund, like this:
Notice that if we want to see our aggregate value broken out by Fund, we include fund in our select statement, and then we include it again in our group by statement. This query tells the database, "For all the rows in transactions where the date is in 2023, find each unique value for Fund and add up all the money that came in to that fund." This results in one row per unique fund with the total amount of revenue for that fund. Some notes about group by:
- Every column reference in your
selectstatement that is not part of an aggregate function must be listed in yourgroup by. - You can include non-aggregate functions (like
year(),month(),substring(), etc.) in yourselectandgroup byclauses. - You can include
casestatements in yourselectandgroup byclauses as long as thecasestatements do not themselves contain aggregate functions. - You cannot include aggregate functions in your
group byclause.group bytells the database how to group the outputs of your aggregates. - You can only include expressions that contain column references in your
group by; you cannot include bare literal values like a string'my string'or a number like42. - You can (and often will!) group by multiple columns.
Group by multiple columns
Grouping by multiple columns is as simple as adding additional column references to your select and group by clauses. For example, if we wanted to summarize 2023 revenue by both Fund and Channel we could modify the query above like so:
The result of this query is one row for every unique combination of fund and transaction_channel for transactions dated in 2023, with the sum of all amounts for each unique combination displayed in total_amount.
Exercise 1: Group by
You need to know the date that each donor in the database joined your organization (the date of their first gift). Write an SQL query using min() and group by to list the first gift date for each account ID.
Exercise 2: Advanced group by
Your gift processing team is looking to move to a new credit card processing platform. As part of their work, they've requested a report that shows the total amount of money contributed via credit card, broken down by channel and the year of the transaction. Write an SQL query to generate this list.
Aggregates are Optional
Finally, I want to note that group by can be used without any aggregate functions whatsoever. In practice, there are not many reasons you'd employ this design pattern, but I mention it in the interest of completeness. group by without any aggregate functions does the same thing that it does when aggregate functions are present – it groups rows by unique values.
For example, we know that the query below will sum up the amount of money contributed to each fund and display the results by fund:
select
fund
,sum(amount) as total_amount
from transactions
group by fundIn this query, group by functions as a way to tell the database, "I only want one row in my results for each unique value of fund." If we then remove the aggregate function altogether from our query, group by performs exactly the same function:
We end up with one row for each unique value of fund in the transactions table. When used in the absence of aggregate functions, you can think of group by as a funhouse mirror version of select distinct – it removes duplicate values. Because distinct exists and is easier to read and understand (among a few other reasons beyond the scope of this course), if this is the kind of output you wanted, you'd probably just write this query as:
select
distinct fund
from transactionsIt's shorter, more clear, and will produce identical results. Later in the course we will cover one scenario where you will sometimes use group by without an aggregate function in your select clause. For now, I just want you to know this is possible.