This story is a sequel to my previous story Complex Deduplication in BigQuery.
Survivorship requires a dataset with some kind of
group_id field. My previous story demonstrates how to deduplicate a dataset, so we’ll skip that step for now and start with a pre-grouped source table.
2. Grouping candidates into arrays
The most efficient method I’ve found for chunking records for survivorship is to roll them up with
ARRAY_AGG. For this to work, you must have fewer than 1 million candidate records per group, and ideally, you’d have tens of them. This will be very efficient in BigQuery if the candidate count per group is low.
You must wrap multiple fields in a
struct to group entire records.
The result of this query will be 2 rows, each containing all survivorship candidate records in a single, queryable field value. A good way to conceptualize this operation is that the
ARRAY_AGG outputs a
JSON array and
STRUCT outputs a
JSON struct like this:
"name": "Albert Einstein",
In this step, we will be
UNNESTing the candidate arrays in subqueries to handle various complex scenarios. The
UNNEST pattern allows us to design our survivorship queries as if an entire table contains only our candidate rows, making reasoning about complex survivorship scenarios much easier.
Most recent non-null
Probably the simplest and most common survivorship strategy is to take the most recent non-null value.
Here’s what that would look like if our table contained only a single group:
Here it is in the main survivorship script:
A few things to note:
- Our survivorship query is now a subquery in the
- We use
SELECT AS STRUCTand
.*produces one output column for each column or top-level field of the subquery. This prevents the annoying
f0_.namethat you’d get without it. If you only query a single field this isn’t necessary.
UNNEST(candidates)takes the candidate array and returns a table
name_recent_createdare from the same row. If we used the same query for a different field we’d likely get a different date
Another common survivorship scenario is most recent valid. This is used for emails and phone numbers quite often. We’ll start with the same query as the previous example and add one more
Here’s the new query with this example appended:
Source system prioritization
In this example, we’ll prioritize certain source systems over others and choose the most recent records from within a priority group. We could also ensure that the value is at least a year old so we don’t end up with staleness due to this prioritization, but I’ll keep it simple for now.
I’m going to stop adding these to the main query now that you’ve got the idea.
This is very common for names and addresses. For names, it’s often total character length, but for addresses, it’s usually a score for each non-null field.
state will usually have high scores and
address_line2 will have a lower one. Addresses often have validation logic of fields like
state as well. This is generally the most complex survivorship use-case for firms that don’t use a third-party address validation service. If they do, address survivorship turns into source system prioritization.
Even though I don’t have address data, here’s an example of what I’m talking about:
You’re probably thinking, “wow, this query is getting gnarly!” and I’d agree. You can DRY it up significantly if you’re using DBT however.
Here’s what a DBT macro would look like for the most recent valid use-case:
And here’s what the model query would look like after:
This is so much cleaner and I can reuse my logic! This is one of the many reasons I love DBT.
- While I love BigQuery, my main data warehouse tool is currently Snowflake. I don’t have access to a personal Snowflake instance to use to write my blogs so BigQuery is an excellent alternative. I’d love to have Snowflake versions of these posts as well.
All code in my blog is meant for instructional purposes only. I make no guarantees of accuracy and if you choose to implement any of this, you do so at your own risk.