39. row_number()

We've already seen row_number() in action in the intro chapter. In this section, we'll unpack more details about row_number() specifically and use this function as a platform for exploring what's possible with the over() clause more generally. The row_number() window function has several common uses in fundraising, some of which are:

Identifying the first gift from a donor/to a campaign
Identifying the most recent gift made by a donor/to a campaign
Identifying the largest/smallest gift made by a donor
De-duplicating data

Any time you need to order rows and select the 1st or the Nth row in a set, you'll want to use row_number(). In the previous chapter, we walked through an example using row_number() to find the first gift details for all donors in the database. We can use exactly the same design pattern to find each donor's most recent contribution (MRC):

No Results

Page / 1,000 10 of 10,000 records

If you compare this query to the one we used in the last chapter, you'll see that everything is exactly the same except in our over() clause, we specify order by transaction_date desc rather than asc. Our query is still considering each account_id value to be its own group (because we are still using partition by account_id), but we've changed the way our row numbers are ordered. By ordering by desc, we tell the database to make the last (i.e., most recent) transaction row number 1, with the next most recent transaction listed as number 2, and on like that. To drive this home a little:

The column(s) you specify in the over() clause's order by statement tells the database which column(s) in the row should be used to determine the order.
By specifying asc or desc, you tell the database whether to assign row number 1 to either the lowest or highest value of the column you specified (transaction_date in this case).

Exercise 1: Row Numbering

The sustainer team is trying to get some more information about how donors are acquired into the program. They'd like a list of the amount and channel of every sustainer's first ever sustainer gift. Output should include:

Donor ID
First sustainer gift date
First sustainer gift channel
First sustainer gift amount

Write an SQL query using row_number() to generate this list.

No Results

Multiple Columns in `order by`

Just like the order by clause in a query, the order by statement in an over() clause allows you to sort data by multiple columns. In our work, this is most helpful when you need to "break ties". Consider a donor who makes two gifts on the same day but they are of different amounts. We want to use their MRC to build an ask string for an upcoming campaign, but which of these two gifts (on the same day) should "count" as their MRC? Generally, we'd want the largest of the gifts to be listed as the MRC. We can do that in SQL as follows:

No Results

Page / 1,000 10 of 10,000 records

In this query, we've updated the row_number() function in our qualify clause to specify that we want to order by transaction_date desc, amount desc. This tells the database, "First, sort the rows in each group by transaction date descending. If there are multiple gifts on the same date, sort those such that the largest amount comes first, then the second largest amount, and so on."

Exercise 2: Order By

Write an SQL query to find each donor's first ever gift to the Trees fund. In the event that multiple gifts came in on the same day, take the one with the highest amount. Output should include:

Donor ID
Transaction Channel
Amount
Transaction Date

No Results

Multiple Columns in `partition by`

Finally, note that you can use multiple columns or expressions in the partition by statement as well, separated by commas. This allows you to define the groups that your window functions operate over with precision. As an example of how this may be used, consider how we would find each donor's first gift to each of the Birds and Trees funds. We want to group rows by two values:

The account_id on the row
The fund on the row

We can do this as follows:

No Results

Page / 1,072 10 of 10,714 records

Most of the donors in our database have only given to one of the two funds. So for example, there is only one row for donor 437 here, because all of this donor's gifts are to the Trees fund. But look at account ID 1039. This ID appears twice in our results, one row each for their first gift to each of the Birds and Trees funds.

Exercise 3: Partition By

Membership is looking to do some analysis of how donors give to the one-time and sustainer programs. They'd like a list of transactions that includes the first ever one-time program gift and first ever sustainer gift for each donor in the database. Output should include:

Donor ID
Program
Transaction Date
Amount

In the event that a donor has given two gifts to the same program on the same day, include only the one with the higher amount. Write an SQL query using row_number() to generate this list.

No Results

Troubleshooting

When you're getting started with window functions, you'll often receive results that aren't what you expect. Unfortunately, that's just part of the process. In these cases, I find it helpful to break my query down to the barest elements and focus on a single account_id or transaction_id (a single instance of whatever you are partitioning by). Here's an example of what I mean.

Thinking about Exercise 3, let's say we used the following code:

select
    account_id
    ,program
    ,transaction_date
    ,amount
from transactions
qualify row_number() over(partition by account_id order by transaction_date asc, amount desc) = 1

We notice during our quality control checking before delivering the data to Membership that each account_id only appears once in the results. This shouldn't be the case, since we know there are many donors in our database who have given both one-time and sustainer gifts, and each of those donors should be in these results twice (one row for their first one-time gift, one row for their first sustainer gift).

How can we unpack this query a bit to see what we're doing wrong? I'll break this down into a few steps, which we will work through as exercises.

Exercise 4: Find an example donor ID

We are partitioning by account_id here, so we want to find a good example ID to use in our troubleshooting. We need to find someone who has given at least one sustainer gift and one one-time gift. Write an SQL query to find a list of donors who meet this criterion.

No Results

Exercise 5: Break Down Your Query

We've got at list of sample IDs that will work. I'm going to use ID number 1089846 in my code, but you can use any ID returned by the results of Exercise 4. We want to examine what all of the rows in the transaction table look like (including our row_number() column) for our example ID. Modify the "broken" query above to select all columns and rows from transactions for our sample ID, and include the "broken" row_number() column as well.

No Results

Exercise 6: Diagnose and Correct

Because we are partitioning by account_id, the results of the solution to Exercise 5 give us a look at one single partition: a list of all the rows that are grouped under account_id 1089846 (or whichever ID you chose). Let's scroll over to the right of these results and inspect our rnum column (or whatever you named the column created by your row_number()).

Think critically about what you would expect. We are trying to arrive at a situation where:

The donor's first ever one-time gift has rnum = 1
All other one-time gifts have rnum greater than 1
The donor's first ever sustainer gift has rnum = 1
All other sustainer gifts have rnum greater than 1

When we inspect the rnum values, however, we notice that rnum starts at 1 for the donor's first ever gift (of any kind) and then just starts going up and does not reset once we have moved from their one-time to sustainer gifts. If the row_number() does not reset when we expect, that means we are not partitioning by the right column(s), since partition by tells the database which rows to consider part of a group. Correct the partition by statement and run the results for this same single ID again to verify that your fix has worked.

No Results

Now that you've worked through the problem with a manageable number of rows and a single account_id, you can remove the where clause and make the needed changes to the select clause (and possibly add qualify), and you've arrived at the correct results for Exercise 3.

This troubleshooting pattern, looking at a small set of example records and inferring problems in your code from there, is incredibly helpful. It works well when using window functions, but works equally well when troubleshooting all kinds of queries gone awry.