Complex Survivorship in BigQuery

This story is a followup to my previous story Complex Deduplication in BigQuery.

Survivorship is the process of choosing canonical values from deduplicated data. The results are often called master data or golden records.

1. Preparation

2. Grouping candidates into arrays

You must wrap multiple fields in a struct to group entire records.

The result of this query will be 2 rows, each containing all survivorship candidate records in a single, queryable field value. A good way to conceptualize this operation is that the ARRAY_AGG outputs a JSON array and STRUCT outputs a JSON struct like this:

[
{
"name": "Albert Einstein",
"email": "al@gmail.com",
...
}
]

3. Survivorship

Most recent non-null

Here’s what that would look like if our table contained only a single group:

Here it is in the main survivorship script:

A few things to note:

Validation

Here’s the new query with this example appended:

Source system prioritization

I’m going to stop adding these to the main query now that you’ve got the idea.

Most complete

Even though I don’t have address data, here’s an example of what I’m talking about:

Using DBT

Survivorship macro

And here’s what the model query would look like after:

This is so much cleaner and I can reuse my logic! This is one of the many reasons I love DBT.

Next steps

  • Add JavaScript stored procedures to the mix.

Disclaimer

I'm a software architect, data engineer, surfer, and musician who loves problem solving and interesting tech.