42. Know Your Data
To put it bluntly, there is no substitute for detailed knowledge of the data you are querying. This knowledge is the foundation on which you will build your entire coding practice. I take a maximalist view of what this means. If you are routinely using SQL to answer questions, solve problems, or make decisions, you have to know:
- The relationships present in your data – How the tables relate to one another, how the fields within those tables relate to each other and to other tables.
- The flow of your data – How does data enter your database/system? When, where, and why is it updated, transformed, merged, and deleted? When you hand the data off to others, what are they using it for and which systems does it become a part of?
- How others use the data – Which decisions, programs, campaigns, and efforts rely on this data? Which pieces of the data are most important to the various people and teams that rely on it?
- Which real-world entities are modeled by my data – Transactions, accounts, donors, source codes, contact information: all of these are real-world entities that are in some sense tangible. What do they mean? How do they relate (e.g. donors have accounts, accounts have transactions, transactions are monetary, etc.)
To the extent that I have been successful in my career, it is because I have worked hard to answer these questions and understand these relationships. My number 1 piece of advice for anyone looking to build a career around writing SQL for nonprofit fundraising is: SQL is easy, fundraising is hard. Learn fundraising first and use SQL to explore it and deepen your understanding.
How to Learn
Learning about the nuts and bolts of your data can be done with SQL itself. A great way to explore your data is just to select
a bunch of it and have a look. I'm going to introduce two things we haven't covered yet that will be helpful: limit
and order by random()
.
Look at a few rows
The limit
clause comes at the end of an SQL query and limits the number of rows returned by the database when your query is executed. We will cover some cases where it's useful later on, but right now its purpose is simple: we want to have a look at some of the data in our tables but we don't want to look at all of the rows.
For example, let's say we want to have a look at the data in our transactions
table. It's got thousands of rows in it though, and we don't want to waste database resources selecting them all, because there's no way we are going to look at that many manually. We only want to look at 5 rows to get a feel for the data that's stored in this table. We can use limit
to do this:
The limit
clause always comes at the very end of your query, after everything else. The database will execute your entire query including all of its logic, but will only stream the number of rows to your results that you specified in your limit
clause.
Note: In SQL Server (used by Raiser's Edge and EveryAction Pipeline), the limit
keyword does not exist. Instead, it uses select top N
syntax, like this:
select
top 5 *
from transactions
Look at random rows
Each database makes different choices about the order in which to return data. There is no guarantee that if you execute the limit
example above that you will always get the same 5 rows (but you probably will), and sometimes data is returned in some kind of specific order (e.g., the first 5 rows with the lowest value for the primary key).
When we're trying to explore data, this behavior can be undesirable. For example, the 5 rows returned may all be very old or very recent transactions that are not representative of the data in the table more generally. In order to ensure we're getting a sampling of data from the database, we can ask the database to sort the data randomly and then use limit
to get only a small number of rows. Here's what that looks like:
If you refresh the page, you'll repeatedly see 5 different random transactions in this listing (clicking "Submit" over and over won't do it for technical reasons related to this site's design). This ensures that we're looking at a random sampling of the data in our transactions
table, rather than transactions returned in some arbitrary order known only to the query engine.
The syntax for randomly ordering results differs by database dialect. The table below gives the syntax for several popular databases:
Additional tips for exploring the relationships present in your data can be found in Chapter 26.
Build a Broad Base
In this chapter, I've strongly recommended learning as much as possible about fundraising. But how do you do that? Maybe you're already a fundraiser and you know plenty about your area of expertise (major gifts, direct response, events, etc.) but you don't know much about other areas. Maybe you're an analyst or you're new to the industry and you know your way around Excel but phrases like "RFM" and "Mid-level portfolio" mean nothing to you. Regardless of your existing knowledgebase or career stage, I suggest you approach every data-based interaction with curiosity and the belief that you're about to learn something new.
One way to do this is to always be asking questions. When I engage on a new project, here are some of the things I ask:
- The data I'm querying, where does it come from? Is it manually entered, entered by some automation, or a combination?
- The data I'm delivering, who is going to use it? Is the user the same as the person making the request? Or are they passing it off to a third party?
- Once delivered, which decisions will be made using this data? Are they big-picture and strategic? Tactical and short term?
- Is this a new question, or has this been done before? If it's been done before, what (if anything) is different now? If it's a new question, why is it being asked now?
You will probably feel silly asking questions like these when all you're doing is pulling a list of donors who have done X, Y, and Z and delivering it to someone for the umpteenth time. But asking these types of questions is essential to drawing out from other people information that they take for granted – information that may be useful to you. You'll learn, perhaps, why pulling this list in November is different from pulling it in June. Or that there's some kind of big decision being made by the Gala events team this month. Over time, these answers compound to build a broad base of knowledge.
The better you understand fundraising (and the specific implementations of fundraising at your organization/company), the more effective the SQL you write will be. By asking these questions, you will save time and most importantly you will put yourself in the position to anticipate the needs and use cases of the people you are delivering data to.