This story is a sequel to my previous story Complex Deduplication in BigQuery.
Survivorship requires a dataset with some kind of
group_id field. My previous story demonstrates how to deduplicate a dataset, so we’ll skip that step for now and start with a pre-grouped source table.
Deduplication is challenging with tabular data. You may have customers signing up for promotions with different email addresses, enrichment data that only joins on phone number, multiple emails or addresses for a given customer, missing values, and a multitude of other challenges. On top of all of that, deduplication is inherently a graph problem — specifically connected component subgraph extraction or union find. …
The goal of this story will be to give you a solid understanding of ASTs, a working example of a custom linter in Python, and demystify how code is reasoned about programmatically.
One of the first steps in compiling code is converting it into an AST or Abstract Syntax Tree. ASTs turn code into a tree/graph of functions. In the code example below you’ll see how to generate them for Python, in Python.
The next few sections deal with AST basics. If you have a solid understanding of them and would like to get right to the code, click here…
In this post I’m going to demonstrate how a blockchain works using nothing but SQL Server. I’m not suggesting this as an ideal implementation. I decided to make this demo self-contained since I’m focusing mostly on the schema and data, my target audience is database professionals, and SQL Server has all the required cryptographic functions.
Probably the most well known use of a blockchain is Bitcoin and there are tons of resources for learning how bitcoin works and for learning how blockchains work. In this example I’ll be starting with the basics and adding features incrementally.
When people use the…
Lookup Table: A table with 2 fields at a minimum, a meaningless primary key and a text value used in normalization to avoid repeating textual values and referenced by foreign keys in other tables. I’ve also heard them called Reference Tables, and Domain Tables. These tables are not changed during the normal course of business (unlike something like ProductCategory which looks the same but I’d argue isn’t a lookup table). This is my definition so don’t take this as gospel.
SQL Server Data Tools (SSDT): Functionality for developing, testing, deploying, and refactoring SQL Server databases in Visual Studio that…
I'm a software architect, data engineer, surfer, and musician who loves problem solving and interesting tech.