This story is a followup to my previous story Complex Deduplication in BigQuery.

Survivorship is the process of choosing canonical values from deduplicated data. The results are often called master data or golden records.

1. Preparation

Survivorship requires a dataset with some kind of group_id field. My previous story demonstrates how to deduplicate a dataset, so we’ll skip that step for now and start with a pre-grouped source table.


This story covers the first step in the deduplication and survivorship process used in master data management. If you’re interested, check out the followup story Complex Survivorship in BigQuery.

Deduplication is challenging with tabular data. You may have customers signing up for promotions with different email addresses, enrichment data that only joins on phone number, multiple emails or addresses for a given customer, missing values, and a multitude of other challenges. On top of all of that, deduplication is inherently a graph problem — specifically connected component subgraph extraction or union find. …


How to develop a better understanding of ASTs by building our own Python linter from scratch

The goal of this story will be to give you a solid understanding of ASTs, a working example of a custom linter in Python, and demystify how code is reasoned about programmatically.

ASTs

One of the first steps in compiling code is converting it into an AST or Abstract Syntax Tree. ASTs turn code into a tree/graph of functions. In the code example below you’ll see how to generate them for Python, in Python.

The next few sections deal with AST basics. If you have a solid understanding of them and would like to get right to the code, click here


A self-contained, private blockchain implemented within a basic messaging app in SQL Server.

Intro

In this post I’m going to demonstrate how a blockchain works using nothing but SQL Server. I’m not suggesting this as an ideal implementation. I decided to make this demo self-contained since I’m focusing mostly on the schema and data, my target audience is database professionals, and SQL Server has all the required cryptographic functions.

Probably the most well known use of a blockchain is Bitcoin and there are tons of resources for learning how bitcoin works and for learning how blockchains work. In this example I’ll be starting with the basics and adding features incrementally.

Definitions

When people use the…


Lookup data never really felt like data to me

Definitions

Lookup Table: A table with 2 fields at a minimum, a meaningless primary key and a text value used in normalization to avoid repeating textual values and referenced by foreign keys in other tables. I’ve also heard them called Reference Tables, and Domain Tables. These tables are not changed during the normal course of business (unlike something like ProductCategory which looks the same but I’d argue isn’t a lookup table). This is my definition so don’t take this as gospel.

Examples

State table could go either way. That could be maintained by your application.

SQL Server Data Tools (SSDT): Functionality for developing, testing, deploying, and refactoring SQL Server databases in Visual Studio that…

Benjamin Campbell

I'm a software architect, data engineer, surfer, and musician who loves problem solving and interesting tech.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store