38. What are Window Functions?
You've made it to the last section of instructional material in our course. This section covers window functions, also called analytical functions. Window functions can be very difficult to grasp for beginners and experienced SQL practitioners alike, so don't worry if you feel like you're on unsteady ground in this section. With time and practice, much of this will become second nature.
Window functions are a special type of function that perform calculations across a set of rows that have some type of relationship to the current row, without collapsing them into a single result like aggregate functions do. These functions will allow you to compute things like:
- Running totals
- Rank orders
- Moving Averages
- Percentiles
Crucially, you can perform these operations while still keeping disaggregated rows in the result. Ok, that's really abstract so let's look at an example to get on firmer ground. One thing we need to do often in fundraising is identify a person's first gift – the gift that established a donor's relationship with the organization. We already know how to find the first gift date using aggregation, like this:
select
account_id
,min(transaction_date) as first_gift_date
from transactions
group by account_id
But what if we want to know the channel of the first gift? Or the amount of the first gift? Things are trickier now, aren't they? This is where window functions come in. They will help us figure out the order of a donor's transactions (based on their transaction_date
) while keeping the entire row including other columns like amount
and transaction_channel
in the query. We're going to walk through an example of finding a donor's first gift date, amount, and channel, using window functions.
Ordering with row_number()
row_number()
is a window function that assigns a unique, sequential number to each row within a set based on an order you specify. I'll first give an example of how it's used and then explain what's going on:
select
account_id
,amount
,transaction_channel
,transaction_date
,row_number() over(partition by account_id order by transaction_date asc) as gift_number
from transactions
order by account_id
Let's break apart the individual elements of our row_number()
function:
row_number()
is the name of the function. It will return a sequential number starting with 1. There are other window functions which we will examine later.over()
is a special clause used with window functions. Inside its parenthesis, you can specify how you want to group rows and how you want to order them.partition by
tells the database which column(s) should be used to divide the rows up into groups. In this case we are partitioning byaccount_id
, which means the database will put all rows with the sameaccount_id
into a group.order by
tells the database how to sort the rows within a group. In this case, our rows are grouped byaccount_id
and then within each group, they are ordered by thetransaction_date
ascending, so the oldest transaction comes first, then the next oldest, etc.
Putting this all together, the row_number()
function in this example tells the database, "For every row in the results of this query, consider rows that share an account ID to be part of a group. Within each group, assign row number 1 to the oldest transaction, then row number 2 to the next oldest transaction, and continue that way. When we've finished with one account ID, move to the next one, restarting the count at 1." Here is what these results look like:
Let's look at those results for a moment. The first account_id
value, 311, only has one gift in the database. Rows with account_id
of 311 are all in their own group in the row_number()
function, so the database starts by numbering the oldest gift from this account_id
as number 1, which you can see in the gift_number
column. Now the database moves on to account_id
437. This account has many gifts, but as you can see, the oldest one (transaction_date
of 1992-12-31
) carries gift_number
= 1, the next oldest is gift_number
= 2, etc. The database continues like that, giving a number to each gift from account 437 all the way up to it's most recent gift (gift_number
= 29). Then we reach a new account_id
and this process starts over again.
We now have a table that tells us, for every gift in the database, whether it's the first, second, third … Nth gift ever made by that donor. If we need to know the transaction_channel
of the donor's first gift, it's as simple as finding the row with gift_number
1 and looking at its transaction_channel
. Let's do that next.
Qualifying Results
Most modern databases provide the qualify
clause. This clause works for window functions the same way that having
works for aggregates. It is a post-execution filter that limits the results we get from the database. We saw in our first step that we can build a list of all gifts in the database numbered by whether they are the first, second, Nth gift the donor ever made. In our example, we only care about the first gift, though. And we can limit our results to only those gifts using qualify
:
Notice that the structure of our qualify
clause is virtually identical to how we would structure a having
clause. We just put our same row_number()
function in and tell the database, "only show me results where this value is equal to 1", which we have established is the donor's first gift. And that's it - we now have a list of every donor's first ever gift including the amount, date, and channel of that gift.
Also, note that just like with having
with group by
, if we use qualify
, we do not need to actually include our row_number()
statement as part of our select
clause. The query below does the same thing as the query above, but omits the gift_number
column from the output:
There will be times when you want to include the numbers generated by row_number()
in your output, and times when you don't. SQL is flexible and allows for either approach.
Alternative to qualify
Not all databases (notably some versions of SQL Server, used by Raiser's Edge and EveryAction as well as Oracle used by ROI CRM) do not support the qualify
clause. In this case, we will need to use a subquery or CTE to limit our results. For example, we can do something like this:
with ordered_gifts as (
select
account_id
,amount
,transaction_channel
,transaction_date
,row_number() over(partition by account_id order by transaction_date asc) as gift_number
from transactions
)
select
*
from ordered_gifts
where gift_number = 1
The principle here is the same: use row_number()
to generate a resultset that orders the rows for each account_id
by transaction_date
. We just have to insert this extra step of using our query to generate an interim result and then select
from that result where gift_number = 1
. Throughout this course, we will be using qualify
for the sake of brevity and clarity.
Summary
We're going to dive into more details in the coming sections. For now, keep in mind:
- Window functions don't remove rows. Unlike aggregation with
group by
, they return a value for every row in the original resultset. - They operate across rows that are related to the current row. You define this relationship using the parameters in your
over()
clause. - You can combine window functions with regular columns in your
select
, you do not need togroup by
.