Complex Survivorship in BigQuery
This story is a followup to my previous story Complex Deduplication in BigQuery.
Survivorship is the process of choosing canonical values from deduplicated data. The results are often called master data or golden records.
1. Preparation
Survivorship requires a dataset with some kind of group_id
field. My previous story demonstrates how to deduplicate a dataset, so we’ll skip that step for now and start with a pre-grouped source table.
2. Grouping candidates into arrays
The most efficient method I’ve found for chunking records for survivorship is to roll them up with ARRAY_AGG
. For this to work, you must have fewer than 1 million candidate records per group, and ideally, you’d have tens of them. This will be very efficient in BigQuery if the candidate count per group is low.
You must wrap multiple fields in a struct
to group entire records.
The result of this query will be 2 rows, each containing all survivorship candidate records in a single, queryable field value. A good way to conceptualize this operation is that the ARRAY_AGG
outputs a JSON
array and STRUCT
outputs a JSON
struct like this:
[
{
"name": "Albert Einstein",
"email": "al@gmail.com",
...
}
]
3. Survivorship
In this step, we will be UNNEST
ing the candidate arrays in subqueries to handle various complex scenarios. The ARRAY_AGG
+ UNNEST
pattern allows us to design our survivorship queries as if an entire table contains only our candidate rows, making reasoning about complex survivorship scenarios much easier.
Most recent non-null
Probably the simplest and most common survivorship strategy is to take the most recent non-null value.
Here’s what that would look like if our table contained only a single group:
Here it is in the main survivorship script:
A few things to note:
- Our survivorship query is now a subquery in the
SELECT
clause - We use
SELECT AS STRUCT
and(<query>).*
. The.*
produces one output column for each column or top-level field of the subquery. This prevents the annoyingf0_.name
that you’d get without it. If you only query a single field this isn’t necessary. UNNEST(candidates)
takes the candidate array and returns a tablename_recent
andname_recent_created
are from the same row. If we used the same query for a different field we’d likely get a different date
Validation
Another common survivorship scenario is most recent valid. This is used for emails and phone numbers quite often. We’ll start with the same query as the previous example and add one more where
predicate. The regex I’m using is the absolute minimum validation you’d want to do but gets the point across for this example. If you wanted a more robust validation function, now would be the perfect time to use a JavaScript stored procedure.
Here’s the new query with this example appended:
Source system prioritization
In this example, we’ll prioritize certain source systems over others and choose the most recent records from within a priority group. We could also combine this with a within_last_month flag so it only applies to recent values, but I’ll keep it simple for now.
I’m going to stop adding these to the main query now that you’ve got the idea.
Most complete
This is very common for names and addresses. For names, it’s often total character length, but for addresses, it’s usually a score for each non-null field. city
and state
will usually have high scores and address_line2
will have a lower one. Addresses often have validation logic of fields like city
and state
as well. This is generally the most complex survivorship use-case for firms that don’t use a third-party address validation service. If they do, address survivorship turns into source system prioritization.
Even though I don’t have address data, here’s an example of what I’m talking about:
Using DBT
You’re probably thinking, “wow, this query is getting gnarly!” and I’d agree. You can DRY it up significantly if you’re using DBT however.
Survivorship macro
Here’s what a DBT macro would look like for the most recent valid use-case:
And here’s what the model query would look like after:
This is so much cleaner and I can reuse my logic! This is one of the many reasons I love DBT.
Next steps
- While I love BigQuery, my main data warehouse tool is currently Snowflake. I don’t have access to a personal Snowflake instance to use to write my blogs so BigQuery is an excellent alternative. I’d love to have Snowflake versions of these posts as well.
- Add JavaScript stored procedures to the mix.
Disclaimer
All code in my blog is meant for instructional purposes only. I make no guarantees of accuracy and if you choose to implement any of this, you do so at your own risk.