39. row_number()
We've already seen row_number()
in action in the intro chapter. In this section, we'll unpack more details about row_number()
specifically and use this function as a platform for exploring what's possible with the over()
clause more generally. The row_number()
window function has several common uses in fundraising, some of which are:
- Identifying the first gift from a donor/to a campaign
- Identifying the most recent gift made by a donor/to a campaign
- Identifying the largest/smallest gift made by a donor
- De-duplicating data
Any time you need to order rows and select the 1st or the Nth row in a set, you'll want to use row_number()
. In the previous chapter, we walked through an example using row_number()
to find the first gift details for all donors in the database. We can use exactly the same design pattern to find each donor's most recent contribution (MRC):
If you compare this query to the one we used in the last chapter, you'll see that everything is exactly the same except in our over()
clause, we specify order by transaction_date desc
rather than asc
. Our query is still considering each account_id
value to be its own group (because we are still using partition by account_id
), but we've changed the way our row numbers are ordered. By ordering by desc
, we tell the database to make the last (i.e., most recent) transaction row number 1, with the next most recent transaction listed as number 2, and on like that. To drive this home a little:
- The column(s) you specify in the
over()
clause'sorder by
statement tells the database which column(s) in the row should be used to determine the order. - By specifying
asc
ordesc
, you tell the database whether to assign row number 1 to either the lowest or highest value of the column you specified (transaction_date
in this case).
Exercise 1: Row Numbering
The sustainer team is trying to get some more information about how donors are acquired into the program. They'd like a list of the amount and channel of every sustainer's first ever sustainer gift. Output should include:
- Donor ID
- First sustainer gift date
- First sustainer gift channel
- First sustainer gift amount
Write an SQL query using row_number()
to generate this list.
Multiple Columns in order by
Just like the order by
clause in a query, the order by
statement in an over()
clause allows you to sort data by multiple columns. In our work, this is most helpful when you need to "break ties". Consider a donor who makes two gifts on the same day but they are of different amounts. We want to use their MRC to build an ask string for an upcoming campaign, but which of these two gifts (on the same day) should "count" as their MRC? Generally, we'd want the largest of the gifts to be listed as the MRC. We can do that in SQL as follows:
In this query, we've updated the row_number()
function in our qualify
clause to specify that we want to order by transaction_date desc, amount desc
. This tells the database, "First, sort the rows in each group by transaction date descending. If there are multiple gifts on the same date, sort those such that the largest amount comes first, then the second largest amount, and so on."
Exercise 2: Order By
Write an SQL query to find each donor's first ever gift to the Trees fund. In the event that multiple gifts came in on the same day, take the one with the highest amount. Output should include:
- Donor ID
- Transaction Channel
- Amount
- Transaction Date
Multiple Columns in partition by
Finally, note that you can use multiple columns or expressions in the partition by
statement as well, separated by commas. This allows you to define the groups that your window functions operate over with precision. As an example of how this may be used, consider how we would find each donor's first gift to each of the Birds and Trees funds. We want to group rows by two values:
- The
account_id
on the row - The
fund
on the row
We can do this as follows:
Most of the donors in our database have only given to one of the two funds. So for example, there is only one row for donor 437 here, because all of this donor's gifts are to the Trees fund. But look at account ID 1039. This ID appears twice in our results, one row each for their first gift to each of the Birds and Trees funds.
Exercise 3: Partition By
Membership is looking to do some analysis of how donors give to the one-time and sustainer programs. They'd like a list of transactions that includes the first ever one-time program gift and first ever sustainer gift for each donor in the database. Output should include:
- Donor ID
- Program
- Transaction Date
- Amount
In the event that a donor has given two gifts to the same program on the same day, include only the one with the higher amount. Write an SQL query using row_number()
to generate this list.
Troubleshooting
When you're getting started with window functions, you'll often receive results that aren't what you expect. Unfortunately, that's just part of the process. In these cases, I find it helpful to break my query down to the barest elements and focus on a single account_id
or transaction_id
(a single instance of whatever you are partitioning by). Here's an example of what I mean.
Thinking about Exercise 3, let's say we used the following code:
select
account_id
,program
,transaction_date
,amount
from transactions
qualify row_number() over(partition by account_id order by transaction_date asc, amount desc) = 1
We notice during our quality control checking before delivering the data to Membership that each account_id
only appears once in the results. This shouldn't be the case, since we know there are many donors in our database who have given both one-time and sustainer gifts, and each of those donors should be in these results twice (one row for their first one-time gift, one row for their first sustainer gift).
How can we unpack this query a bit to see what we're doing wrong? I'll break this down into a few steps, which we will work through as exercises.
Exercise 4: Find an example donor ID
We are partitioning by account_id
here, so we want to find a good example ID to use in our troubleshooting. We need to find someone who has given at least one sustainer gift and one one-time gift. Write an SQL query to find a list of donors who meet this criterion.
Exercise 5: Break Down Your Query
We've got at list of sample IDs that will work. I'm going to use ID number 1089846 in my code, but you can use any ID returned by the results of Exercise 4. We want to examine what all of the rows in the transaction table look like (including our row_number()
column) for our example ID. Modify the "broken" query above to select all columns and rows from transactions
for our sample ID, and include the "broken" row_number()
column as well.
Exercise 6: Diagnose and Correct
Because we are partitioning by account_id
, the results of the solution to Exercise 5 give us a look at one single partition: a list of all the rows that are grouped under account_id
1089846 (or whichever ID you chose). Let's scroll over to the right of these results and inspect our rnum
column (or whatever you named the column created by your row_number()
).
Think critically about what you would expect. We are trying to arrive at a situation where:
- The donor's first ever one-time gift has
rnum
= 1 - All other one-time gifts have
rnum
greater than 1 - The donor's first ever sustainer gift has
rnum
= 1 - All other sustainer gifts have
rnum
greater than 1
When we inspect the rnum
values, however, we notice that rnum
starts at 1 for the donor's first ever gift (of any kind) and then just starts going up and does not reset once we have moved from their one-time to sustainer gifts. If the row_number()
does not reset when we expect, that means we are not partitioning by the right column(s), since partition by
tells the database which rows to consider part of a group. Correct the partition by
statement and run the results for this same single ID again to verify that your fix has worked.
Now that you've worked through the problem with a manageable number of rows and a single account_id
, you can remove the where
clause and make the needed changes to the select
clause (and possibly add qualify
), and you've arrived at the correct results for Exercise 3.
This troubleshooting pattern, looking at a small set of example records and inferring problems in your code from there, is incredibly helpful. It works well when using window functions, but works equally well when troubleshooting all kinds of queries gone awry.