Complex Deduplication in BigQuery

This story covers the first step in the deduplication and survivorship process used in master data management. If you’re interested, check out the followup story Complex Survivorship in BigQuery.

Deduplication is challenging with tabular data. You may have customers signing up for promotions with different email addresses, enrichment data that only joins on phone number, multiple emails or addresses for a given customer, missing values, and a multitude of other challenges. On top of all of that, deduplication is inherently a graph problem — specifically connected component subgraph extraction or union find. In this story I’ll show you how to do this all in BigQuery³.

1. Preparation

We’re going to be deduplicating on any non-key field that could uniquely identify an individual. In this example, those fields are email, phone_number¹, and salesforce_id. Each field will be a node. Edges will be the complete graph connecting them. This just means that all nodes are connected to all other nodes. This will correspond to the number of queries in the Create edge table section. It’s best to keep the number of deduplication fields low since the number of edge queries will be field_count(field_count − 1)/2.

Here’s the complete graph for our example:

Here’s what it would look like with 2 more fields:

2. Create node tables

We need one query per field that will be connected by an edge. You’ll notice that we have 2 places for salesforce_id. We want the records that reference a salesforce record to have nodes for both their own data and the data from the salesforce row.

Schema

  • business_key and source_system are used to join back to raw_data
  • node_id⁴ is a hash of node_value and note_type

Now we can create our node table from this bridge table:

3 . Create edge table

Schema

  • source_query is for debugging to identify the source of the edge
  • group_id is a null int64 that will be populated by a seed, then updated by the final deduplication query in a loop

Full Edge Query

4. Seed group_id

5. Convert to undirected graph

6. Deduplication

Note the group_id <> min_group_id in the update statement. That’s how we’ll terminate the loop. The first time we run this and get @@rowcount == 0 we’ll know we are done.

Here’s a step by step example of how it works…

Notice that Step 3 didn’t update any records so we know we’re done.

There are 2 groups in this example:

Walk in a loop

You can limit iterations per run by changing the while predicate to (updates > 0 and i < max_iterations) or i = 0. It’ll start where you left off if you end up stopping and restarting it.

I’ve run this on ~800 million records⁵ without issue. This method is able to handle some truly massive datasets.

Results

If you’d like to squash the group_id values, use the following query. If you are doing multiple runs and need stable group_id values, you must only squash group_ids > the max from the previous run.

Next Steps

  • We could perform fuzzy matching via the join criteria in our edge query. We could also add edge weights and in our deduplication process we could sum the weights between 2 nodes to determine if it’s a true match.
  • The walk query can be optimized. You only have to update the group_id for a node_id that was updated in the previous round. If you add an iteration_id to the edge table, you can filter using this, or you can recreate a working table each round limited to the records you must consider. It’s more complex and error prone (you must consider all node_ids with changes from the previous iteration), but worth it if you have a huge dataset.
  • Build this into a DBT macro.

Appendix

Seeding groups between multiple runs

Using a graph database

Footnotes

[2]: We could just use the edge_id if we want, but I prefer to use row_number() to avoid massive negative numbers. We can always squash the results with a dense_rank() after we’re done.

[3]: You could adapt this process to work on any tabular SQL based dbms (like Snowflake). You could also use a graph database.

[4]: farm_fingerprint() is used here for terse output. Hash collision likelihood is an example of the birthday problem, so you’ll want to use a beefier hash function for large datasets. Check this out for examples of hash size and collision likelihoods.

[5]: For 800 millions records FARM_FINGERPRINT would be insufficient. See [4] above.

Disclaimer

I'm a software architect, data engineer, surfer, and musician who loves problem solving and interesting tech.