Blockchain by Example in SQL Server
A self-contained, private blockchain implemented within a basic messaging app in SQL Server.
Intro
In this post I’m going to demonstrate how a blockchain works using nothing but SQL Server. I’m not suggesting this as an ideal implementation. I decided to make this demo self-contained since I’m focusing mostly on the schema and data, my target audience is database professionals, and SQL Server has all the required cryptographic functions.
Probably the most well known use of a blockchain is Bitcoin and there are tons of resources for learning how bitcoin works and for learning how blockchains work. In this example I’ll be starting with the basics and adding features incrementally.
Definitions
When people use the term blockchain they are generally talking about the Bitcoin blockchain. At its most basic, a blockchain is just rows of data that reference previous rows (a linked list) with a hash that contains the previous row’s hash. Bitcoin has additional features allowing it to be updated publicly, making it “trustless”. I’m going to start with the most simple implementation and build (almost) up to that.
First there are a few cryptography concepts you’ll want to be familiar with.
- Cryptographic Hash (specifically SHA256): A function that for a given input of any size outputs a deterministic, fixed length “signature”. It must be very sensitive to changes in the input, outputting a totally different, random looking signature for even a single bit change (the avalanche effect). It also must be fast to compute, free of collisions (2 different inputs giving the same output), and non-reversible (can’t determine the input form the output). Try it out here.
SHA256("Hello") = 0x185f8db32271…
SHA256("Hello!") = 0x334d016f755c…
- Public or Asymmetric Key Cryptography: An encryption method using 2 keys where the key used to encrypt a message (the public key) is different from the key used to decrypt it (the private key). Think of the public key as a lock and the private key as a key to that lock. I can put a huge pile of these locks out in the middle of a public place and you grab one, write me a private message, put it in a box, and lock it with my lock. Now you can leave the locked box out in public or have anyone deliver it to me, even someone you don’t trust. No one can open it but me, not even you. These key pairs can also be used to sign messages. I sign it with my private key and you can confirm that I signed it using my public key but are unable to forge my signature since you lack my private key. Bitcoin uses public key crypto to secure the coins so only the person with the right private key can spend them. I’ll use asymmetric key encryption in my example as an alternative to mining and distributed trust by signing blocks instead of hashing them.
- Nonce: A meaningless piece of information for one-time use. In the case of a blockchain the nonce is used to mine blocks. The nonce gives the miner control over part of the data being hashed allowing them to change the hash in an attempt to create one with special properties, in this case a value below a certain threshold. Mining is really just randomly picking integer values for the nonce and rehashing everything until you “win the lottery” getting a hash low enough (starts with a bunch of zeroes). Check it out below. In the first example I computed the hash of my data without the nonce and then after trying 2,081,747 values I won the hash lottery and got a low enough hash value.
SHA256("Hello!") = 0x334d016f755c…
SHA256("Hello!|Nonce=2081747") = 0x0000048817E2…
- Merkle Tree: A binary tree of hashes, the leaves being data blocks, allowing you to check quickly if a certain leaf is valid even if there are many leaf nodes. This is very similar to an index seek but for validation rather than retrieval. Another good explanation here.
- Block/Blockchain: This isn’t technically a cryptography concept but it may help to explicitly define what a block and a blockchain actually is. A block in a blockchain is a row of data that contains a hash of some underlying data (like a group of ledger transactions), a reference to the prior block and its hash(except for the first or “genesis” block), and a hash of all the data in the row. In any implementation the block has more features but at its core these fields give a block its ability to guarantee the trustworthiness of the content and order of some underlying data. In my example and in Bitcoin the MerkleRoot field is the UnderlyingDataHash and we also have a TransactionCount, Version, and CreatedDateTime. In a public blockchain you’ll need Nonce and Difficulty fields to do the mining (and the BlockHash will start with zeroes). In a private blockchain BlockHash can be replaced with a BlockSignature using public key encryption instead of a mined SHA256 hash since there is only one party writing to the table. For what in my opinion is the best diagram check out “Merkle Proofs in Bitcoin” here.
Getting Started
Blockchains can be used to protect the order and integrity of any type of transaction. While the obvious use-case is cryptocurrencies I’ve decided to create a demo using a messaging app. With each iteration of the code a demo.sql
file will change allowing you to try out certain scenarios.
Existing App (Code Here)
This is the most basic messaging app I could think up. We have users that can send messages with a subject and a body to one another. We also track when the message was read. MessageID is sequential.
Upgrade 1 — The Most Basic Blockchain (Code Here)
All we did is add 4 columns to the Message table. Each message is also a block.
- MessageHash: A SHA256 hash of all the fields in the table with the exception of ReadDateTime since it can be updated.
- HashVersion: An INT telling us the version of the logic used to concatenate field values into a delimited string. If this field wasn’t there, we could never add fields to this table or fix bugs in the hash logic. This allows backward compatibility when verifying old rows.
- PrevMessageID: References the message prior to this one. Note that I’m assuming the MessageID is sequential. There is a unique index on this field so there can only be one NULL (the first block, also called the genesis block) and one child per parent. This creates the “chain” with messages as blocks.
- PrevMessageHash: This is the MessageHash from the row referenced by PrevMessageID. This value is included in the current message’s hash. This means that all prior data is rolled up into this single hash. If someone were to alter anything in the past they’d have to redo all hashes from that message forward.
So what did that get us? When a user creates a message in our system we can return the MessageHash as a receipt. The user can use this to ensure that all message data and the order of messages in the system haven’t been tampered with up to this point. They’d need a trusted third party to verify this like an auditor. A third party could rehash all the messages and ensure all hashes match all receipts if needed. If any data corruption or manipulation occurs they’d be able to catch it.
The big win here is that we have a single BINARY(32) value that is a fingerprint of all the data in our system including its order. The hash will change if anything it protects is changed.
What are we lacking? We can’t protect the ReadDateTime or any other future updateable field since you can’t change the hash once a new message is written without changing all the subsequent hashes.
Also, users need to keep those hash receipts or there’s no way to verify anything. A hacker or a bad admin at our firm could update all the hashes and it’d be your word against ours. For this to be robust we’d need some trusted third party to hold all our message hashes and verify messages if needed which would get tedious.
Upgrade 2 — Transaction Table (Code Here)
The Message table is back to its original schema, we’ve moved the chaining and hashing fields (TransactionHash, HashVersion, PrevTransactionID, PrevTransactionHash) to the Transaction table, and added a TransactionType. Finally MessageID references the message we’re protecting whether it’s the entire message, a mark read, or mark unread event. TransactionHash for message sends is still computed from data in the Message table (in addition to necessary Transaction fields) but read and unread transactions are just hashes of the MessageID, the TransactionDateTime, and the PrevTransactionHash.
Now what do we have?
We can now protect any updates and can even protect things other than messages if we add another table and FK field in Transaction.
What are we lacking?
We still have the same problems requiring third party verification. If we wanted to eliminate the need for third party verification we’ll need to either mine the blocks or sign them. We also have a tight 1:1 coupling between blocks and transactions. This will cause major scaling problems once we introduce signatures and/or mining.
Upgrade 3 — Block Table (Code Here)
Chaining is moved to the Block table but Transactions still store their own hashes. BlockID has been added to Transaction allowing many transactions to be rolled into the same block. BlockID and PrevBlockID now form our chain. TransactionCount is just an aggregate from the Transaction table. While this may look like bad normalization, we want to protect the count in the hash so we explicitly store this to speed up verification. Instead of hashes we use signatures on the blocks (BlockSignature, PrevBlockSignature, and SignatureVersion replace Hash, PrevHash, and HashVersion fields). There are a few new concepts added to this table.
- Signatures on blocks instead of hashes
- Nonce — for mining even though we technically don’t need to mine when using signatures
- MerkleRoot — the root of the merkle tree of transaction hashes defined above
Signatures: Using signatures is a departure from Bitcoin. In the Bitcoin block table they still use hashes. Earlier, in my definition of public key crypto, I talked about how I’d use signatures as an alternative to mining. This is where that becomes important. That signature is created using the private key portion of an X.509 certificate. Trustworthiness of the block comes from your ability to protect your private key instead of the proof of work from the hash value itself. Verification is simple. Either you could register your cert with a certificate authority or just distribute the public key yourself. Anyone can safely be granted access to the signatures and public key. This allows confirmation that you signed a given block. As long as you keep the key safe (and out of SQL Server) you’re protected against hackers and rouge admins. In this example the entire cert (public and private keys) lives in SQL Server for simplicity. In real life you’d use a key vault and do the signing with whatever client is writing the block. Neither devs nor DBAs would have access to the private key or the vault. However, you’d still want to have the public key in SQL Server to allow you to quickly verify blocks at rest.
Nonce: The nonce is just an int that we can change in an attempt to get a special signature value. Changing this field and redoing the hash (or for our example, the signature) is what mining is. This is also referred to as a “proof of work” since the only way to discover this special value is through brute force. You’ll see in my example that I chose random values for the Nonce but in real life you’d just want to increment it. This is because incrementing is faster, dupe free, and no two miners will get the same hash since they include a transaction paying themselves the reward. In Bitcoin the target time between additions of blocks is 10 minutes. If miners regularly beat that time (using a 2 week moving average) the difficulty goes up (the number of zeroes required at the start of the hash/signature goes up) and if they lag behind the difficulty is reduced. When you add the next block you get a reward which decreases over time but is currently around 13 bitcoins in addition to a small transaction fee added to each transaction to entice miners to add it to their current block. At $10,000 per Bitcoin that’s some good money! There’s a reason Bitcoin mining uses more electricity than many small countries. If someone wanted to update data they’d need to mine all their own blocks in order to get them accepted. Anything altered in the past would require re-mining all blocks that followed. That’s why old block are more trustworthy. There is a ton of proven work following them that increases with each new block.
Here’s some super basic mining code:
MerkleRoot: This is the root of a binary tree of all the underlying TransactionHashes from the Transaction table. The reason we don’t just hash all the individual transaction hashes in a straight line is to allow us to verify a single transaction quickly (log(N) hash computations, N being the transaction count). In this version of the code, intermediate merkle nodes are added as additional transactions with a special TransactionType and their order is very important since they actually form a tree. This is how Bitcoin works but in the next upgrade, I’ll split the intermediate nodes and add a Depth field out to speed up verification and better demonstrate their function.
Branching: I left this concept out of my block table but it’s worth discussing. Bitcoin has a BranchID field which allows for cases where 2 people mine the next block at the same time. This is rare but does happen. When that happens, 2 blocks with the same parent are created. If there’s ever a branch all clients are instructed to follow the path with the most work (the lowest hash value). Note that the hashes will never be the same since miners also add a transaction paying themselves the mining reward before they mine. The chances that the next block after a branch will be mined simultaneously is even more rare and once it is added, more and more people will start mining against that branch and the loser will be abandoned. Anyway, with a private blockchain, mining is an optional security enhancement and you don’t need branching.
Upgrade 4 — Intermediate Merkle Node Table and Verification Proc (Code Here)
Here I’ve added a table to store the merkle nodes. These are no longer added to the Transaction table. I’ve also added a proc to verify a given transaction by climbing down the merkle tree. Each NodeHash is eitherSHA256(TransactionID1 + TransactionID2)
for Depth == 1
or SHA256(MerkleTreeIntermediateNodeID1 + MerkleTreeIntermediateNodeID2)
. Depth tells you how many hops you are from a a leaf with leaves being Depth == 0
. This doesn’t add any features we didn’t have before but it’s perhaps better normalization and it makes verification much easier. Check out demo.sql
and the inner workings of [dbo].[TransactionVerify]
.
Conclusion
That’s it, a self-contained, private blockchain implemented on a basic messaging app. You could use it on a ledger, state machine, contract, on-boarding process, or anything else where you want to provide independent verifyablility of data integrity and order.
Let me know if there’s anything else you’d like to see in this example and feel free to ask any questions you may have.