38. What are Window Functions?

You've made it to the last section of instructional material in our course. This section covers window functions, also called analytical functions. Window functions can be very difficult to grasp for beginners and experienced SQL practitioners alike, so don't worry if you feel like you're on unsteady ground in this section. With time and practice, much of this will become second nature.

Window functions are a special type of function that perform calculations across a set of rows that have some type of relationship to the current row, without collapsing them into a single result like aggregate functions do. These functions will allow you to compute things like:

  • Running totals
  • Rank orders
  • Moving Averages
  • Percentiles

Crucially, you can perform these operations while still keeping disaggregated rows in the result. Ok, that's really abstract so let's look at an example to get on firmer ground. One thing we need to do often in fundraising is identify a person's first gift – the gift that established a donor's relationship with the organization. We already know how to find the first gift date using aggregation, like this:

select
    account_id
    ,min(transaction_date) as first_gift_date
from transactions
group by account_id

But what if we want to know the channel of the first gift? Or the amount of the first gift? Things are trickier now, aren't they? This is where window functions come in. They will help us figure out the order of a donor's transactions (based on their transaction_date) while keeping the entire row including other columns like amount and transaction_channel in the query. We're going to walk through an example of finding a donor's first gift date, amount, and channel, using window functions.

Ordering with row_number()

row_number() is a window function that assigns a unique, sequential number to each row within a set based on an order you specify. I'll first give an example of how it's used and then explain what's going on:

select
    account_id
    ,amount
    ,transaction_channel
    ,transaction_date
    ,row_number() over(partition by account_id order by transaction_date asc) as gift_number
from transactions
order by account_id

Let's break apart the individual elements of our row_number() function:

  • row_number() is the name of the function. It will return a sequential number starting with 1. There are other window functions which we will examine later.
  • over() is a special clause used with window functions. Inside its parenthesis, you can specify how you want to group rows and how you want to order them.
  • partition by tells the database which column(s) should be used to divide the rows up into groups. In this case we are partitioning by account_id, which means the database will put all rows with the same account_id into a group.
  • order by tells the database how to sort the rows within a group. In this case, our rows are grouped by account_id and then within each group, they are ordered by the transaction_date ascending, so the oldest transaction comes first, then the next oldest, etc.

Putting this all together, the row_number() function in this example tells the database, "For every row in the results of this query, consider rows that share an account ID to be part of a group. Within each group, assign row number 1 to the oldest transaction, then row number 2 to the next oldest transaction, and continue that way. When we've finished with one account ID, move to the next one, restarting the count at 1." Here is what these results look like:

No Results

Let's look at those results for a moment. The first account_id value, 311, only has one gift in the database. Rows with account_id of 311 are all in their own group in the row_number() function, so the database starts by numbering the oldest gift from this account_id as number 1, which you can see in the gift_number column. Now the database moves on to account_id 437. This account has many gifts, but as you can see, the oldest one (transaction_date of 1992-12-31) carries gift_number = 1, the next oldest is gift_number = 2, etc. The database continues like that, giving a number to each gift from account 437 all the way up to it's most recent gift (gift_number = 29). Then we reach a new account_id and this process starts over again.

We now have a table that tells us, for every gift in the database, whether it's the first, second, third … Nth gift ever made by that donor. If we need to know the transaction_channel of the donor's first gift, it's as simple as finding the row with gift_number 1 and looking at its transaction_channel. Let's do that next.

Qualifying Results

Most modern databases provide the qualify clause. This clause works for window functions the same way that having works for aggregates. It is a post-execution filter that limits the results we get from the database. We saw in our first step that we can build a list of all gifts in the database numbered by whether they are the first, second, Nth gift the donor ever made. In our example, we only care about the first gift, though. And we can limit our results to only those gifts using qualify:

No Results

Notice that the structure of our qualify clause is virtually identical to how we would structure a having clause. We just put our same row_number() function in and tell the database, "only show me results where this value is equal to 1", which we have established is the donor's first gift. And that's it - we now have a list of every donor's first ever gift including the amount, date, and channel of that gift.

Also, note that just like with having with group by, if we use qualify, we do not need to actually include our row_number() statement as part of our select clause. The query below does the same thing as the query above, but omits the gift_number column from the output:

No Results

There will be times when you want to include the numbers generated by row_number() in your output, and times when you don't. SQL is flexible and allows for either approach.

Alternative to qualify

Not all databases (notably some versions of SQL Server, used by Raiser's Edge and EveryAction as well as Oracle used by ROI CRM) do not support the qualify clause. In this case, we will need to use a subquery or CTE to limit our results. For example, we can do something like this:

with ordered_gifts as (
    select
        account_id
        ,amount
        ,transaction_channel
        ,transaction_date
        ,row_number() over(partition by account_id order by transaction_date asc) as gift_number
    from transactions
)

select
    *
from ordered_gifts
where gift_number = 1

The principle here is the same: use row_number() to generate a resultset that orders the rows for each account_id by transaction_date. We just have to insert this extra step of using our query to generate an interim result and then select from that result where gift_number = 1. Throughout this course, we will be using qualify for the sake of brevity and clarity.

Summary

We're going to dive into more details in the coming sections. For now, keep in mind:

  • Window functions don't remove rows. Unlike aggregation with group by, they return a value for every row in the original resultset.
  • They operate across rows that are related to the current row. You define this relationship using the parameters in your over() clause.
  • You can combine window functions with regular columns in your select, you do not need to group by.