19. Group By

We've done about as much as we can using our trusty friends select, from, and where and now it's time to learn about a new type of clause, group by. The group by clause is used in conjunction with aggregate functions to group rows that have the same values in specified columns so that you can apply aggregate functions to each group. Let's look at an example.

Let's say we've been asked to provide total revenue for 2023 broken out by Fund (Birds vs Trees). To do this, we will use sum(amount) as we saw in the last chapter, and we will use group by to split the total out by fund, like this:

No Results

Notice that if we want to see our aggregate value broken out by Fund, we include fund in our select statement, and then we include it again in our group by statement. This query tells the database, "For all the rows in transactions where the date is in 2023, find each unique value for Fund and add up all the money that came in to that fund." This results in one row per unique fund with the total amount of revenue for that fund. Some notes about group by:

  • Every column reference in your select statement that is not part of an aggregate function must be listed in your group by.
  • You can include non-aggregate functions (like year(), month(), substring(), etc.) in your select and group by clauses.
  • You can include case statements in your select and group by clauses as long as the case statements do not themselves contain aggregate functions.
  • You cannot include aggregate functions in your group by clause. group by tells the database how to group the outputs of your aggregates.
  • You can only include expressions that contain column references in your group by; you cannot include bare literal values like a string 'my string' or a number like 42.
  • You can (and often will!) group by multiple columns.

Group by multiple columns

Grouping by multiple columns is as simple as adding additional column references to your select and group by clauses. For example, if we wanted to summarize 2023 revenue by both Fund and Channel we could modify the query above like so:

No Results

The result of this query is one row for every unique combination of fund and transaction_channel for transactions dated in 2023, with the sum of all amounts for each unique combination displayed in total_amount.

Exercise 1: Group by

You need to know the date that each donor in the database joined your organization (the date of their first gift). Write an SQL query using min() and group by to list the first gift date for each account ID.

No Results

Exercise 2: Advanced group by

Your gift processing team is looking to move to a new credit card processing platform. As part of their work, they've requested a report that shows the total amount of money contributed via credit card, broken down by channel and the year of the transaction. Write an SQL query to generate this list.

No Results

Aggregates are Optional

Finally, I want to note that group by can be used without any aggregate functions whatsoever. In practice, there are not many reasons you'd employ this design pattern, but I mention it in the interest of completeness. group by without any aggregate functions does the same thing that it does when aggregate functions are present – it groups rows by unique values.

For example, we know that the query below will sum up the amount of money contributed to each fund and display the results by fund:

select
    fund
    ,sum(amount) as total_amount
from transactions
group by fund

In this query, group by functions as a way to tell the database, "I only want one row in my results for each unique value of fund." If we then remove the aggregate function altogether from our query, group by performs exactly the same function:

No Results

We end up with one row for each unique value of fund in the transactions table. When used in the absence of aggregate functions, you can think of group by as a funhouse mirror version of select distinct – it removes duplicate values. Because distinct exists and is easier to read and understand (among a few other reasons beyond the scope of this course), if this is the kind of output you wanted, you'd probably just write this query as:

select 
    distinct fund
from transactions

It's shorter, more clear, and will produce identical results. Later in the course we will cover one scenario where you will sometimes use group by without an aggregate function in your select clause. For now, I just want you to know this is possible.