Complex Survivorship in BigQuery

4 min readApr 17, 2021

This story is a followup to my previous story Complex Deduplication in BigQuery.

Survivorship is the process of choosing canonical values from deduplicated data. The results are often called master data or golden records.

1. Preparation

Survivorship requires a dataset with some kind of group_id field. My previous story demonstrates how to deduplicate a dataset, so we’ll skip that step for now and start with a pre-grouped source table.

2. Grouping candidates into arrays

The most efficient method I’ve found for chunking records for survivorship is to roll them up with ARRAY_AGG. For this to work, you must have fewer than 1 million candidate records per group, and ideally, you’d have tens of them. This will be very efficient in BigQuery if the candidate count per group is low.

You must wrap multiple fields in a struct to group entire records.

The result of this query will be 2 rows, each containing all survivorship candidate records in a single, queryable field value. A good way to conceptualize this operation is that the ARRAY_AGG outputs a JSON array and STRUCT outputs a JSON struct like this:

[
  {
    "name": "Albert Einstein",
    "email": "al@gmail.com", 
    ...
  }
]

3. Survivorship

In this step, we will be UNNESTing the candidate arrays in subqueries to handle various complex scenarios. The ARRAY_AGG + UNNEST pattern allows us to design our survivorship queries as if an entire table contains only our candidate rows, making reasoning about complex survivorship scenarios much easier.

Most recent non-null

Probably the simplest and most common survivorship strategy is to take the most recent non-null value.

Here’s what that would look like if our table contained only a single group:

Here it is in the main survivorship script:

A few things to note:

Our survivorship query is now a subquery in the SELECT clause
We use SELECT AS STRUCT and (<query>).*. The .* produces one output column for each column or top-level field of the subquery. This prevents the annoying f0_.name that you’d get without it. If you only query a single field this isn’t necessary.
UNNEST(candidates) takes the candidate array and returns a table
name_recent and name_recent_created are from the same row. If we used the same query for a different field we’d likely get a different date

Validation

Another common survivorship scenario is most recent valid. This is used for emails and phone numbers quite often. We’ll start with the same query as the previous example and add one more where predicate. The regex I’m using is the absolute minimum validation you’d want to do but gets the point across for this example. If you wanted a more robust validation function, now would be the perfect time to use a JavaScript stored procedure.

Here’s the new query with this example appended:

Source system prioritization

In this example, we’ll prioritize certain source systems over others and choose the most recent records from within a priority group. We could also combine this with a within_last_month flag so it only applies to recent values, but I’ll keep it simple for now.

I’m going to stop adding these to the main query now that you’ve got the idea.

Most complete

This is very common for names and addresses. For names, it’s often total character length, but for addresses, it’s usually a score for each non-null field. city and state will usually have high scores and address_line2 will have a lower one. Addresses often have validation logic of fields like city and state as well. This is generally the most complex survivorship use-case for firms that don’t use a third-party address validation service. If they do, address survivorship turns into source system prioritization.

Even though I don’t have address data, here’s an example of what I’m talking about:

Using DBT

You’re probably thinking, “wow, this query is getting gnarly!” and I’d agree. You can DRY it up significantly if you’re using DBT however.

Survivorship macro

Here’s what a DBT macro would look like for the most recent valid use-case:

And here’s what the model query would look like after:

This is so much cleaner and I can reuse my logic! This is one of the many reasons I love DBT.

Next steps

While I love BigQuery, my main data warehouse tool is currently Snowflake. I don’t have access to a personal Snowflake instance to use to write my blogs so BigQuery is an excellent alternative. I’d love to have Snowflake versions of these posts as well.
Add JavaScript stored procedures to the mix.

Disclaimer

All code in my blog is meant for instructional purposes only. I make no guarantees of accuracy and if you choose to implement any of this, you do so at your own risk.