George Samman is a blockchain and cryptocurrency consultant and advisor who recently co-authored a seminal report on blockchain architecture with KPMG.
Here, Samman explains how the consensus algorithm achieved by Raft was finally fixed by its distant relative, Kadena.
This article covers Kadena's blockchain. It uses ScalableBFT to offer high-performance (8,000-12,000 transactions per second) with full replication and distribution at previously impossible scales (the capacity for more than 500 participating nodes).
This, along with the multi-layered security model and incremental hashing allow for a truly robust blockchain. Based on Raft and Juno, Kadena embeds a full smart contract language (Pact) into its blockchain that can be run as either public (plain text) or private (double-ratchet encrypted) transactions.
It is a huge step forward in the blockchain space, possibly representing a new-generation of blockchain technology entirely by its introduction of the idea of "pervasive determinism".
Similar to bitcoin, Kadena's blockchain is tightly integrated, and understanding what it is capable of, and what these capabilities imply, requires covering a considerable amount of ground. As such, I've broken the article into three parts: 1) Introduction & Raft, 2) Kadena's Predecessors - Tangaroa & Juno, and 3) Kadena's Blockchain - ScalableBFT, Pact and Pervasive Determinism.
Part 1: Introduction and the Raft Consensus Algorithm
The history behind Kadena is an interesting case study in the new field of blockchain consensus algorithms and distributed computing.
Kadena is a 'distant relative' of the Raft consensus algorithm. The Raft consensus mechanism was followed by Tangaroa (a Byzantine Fault Tolerant (BFT) Raft) and the JP Morgan project Juno (a fork of Tangaroa), neither of which are longer under active development.
JP Morgan's new blockchain Quorum is very different from Juno and uses a fusion of ideas from sidechains and ethereum - public smart contracts are allowed on the blockchain in addition to private contracts, which are represented as encrypted hashes and replicated via side-channels.
Kadena is the "next generation Juno". It uses a new, but related, protocol called ScalableBFT that was spawned from the open-source code of the Juno project and was built by the two key developers who built Juno. Before diving deep into Kadena, a brief history and description of Raft and the predecessors to Kadena need to be discussed.
The Raft consensus algorithm is a single leader-based system for managing a replicated log. It uses a replicated state machine architecture and produces a result equivalent to Paxos, but is structurally different.
Keeping the replicated log consistent is the job of the consensus algorithm. In this model, the leader does most of the work because it is issuing all log updates, validating transactions, and generally managing the cluster. Raft consensus guarantees a strict ordering and replication of messages. It does not care what the messages contain.
A new leader is elected using randomized timeouts, which are triggered if a follower receives no communication from the leader before the timeout fires. These are called "heartbeats".
If the follower receives no communication over this time period, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster (nodes in the network) becomes the new leader. Leaders typically operate until they fail. The heartbeats are sent out to make sure the leader is still there; if nothing is received a new election takes place.
The following stages are how Raft comes to consensus:
- A cluster of Raft node servers is started with every node launching as a "Follower". Eventually, one node will timeout, become a candidate, gain a majority of the votes and become the leader.
- Each node stores a log containing commands. It is the Leader's job to accept new commands, strictly order the commands in its log, replicate its log to its followers, and finally inform followers when to commit logs that they have replicated. The consensus algorithm thus ensures that each server's logs are the same order.
- Logs are "committed" when they have been replicated to a majority of nodes. The leader gathers the replication count and, upon a majority being seen, commits its own new log entries and informs its followers to do the same.
- Upon "commit" the command in each log entry is evaluated by a State machines. Because Raft is indifferent to the body of the command, any state machine can process committed entries. Moreover, consensus assures that command execution always takes place in the same order as the commands come from the Log which is strictly ordered.
- State machines will remain consistent as long as command executions are deterministic.
- When a client sends a command to one of the servers, that server will either forward the command to the leader or is the leader. The leader collects the new command, assigns it a Log Index, encapsulates it in a Log Entry, and adds the command to the uncommitted portion of its log.
- Whenever the leader has uncommitted entries, it replicates this portion of the log to its followers. When the leader is informed of successful replication by a majority of the cluster, it commits the new entries and orders its followers to do the same.
- Whenever a new log entry is committed consensus about this entry has been reached. It is then evaluated by the state machine on each server.
- From this point on, Raft is finished and implementers can decide how to handle responses; replying to the client or waiting for the client to query for the result.
Responses to the client are generally asynchronous.
The Raft consensus protocol is just that - a consensus algorithm. It does not have a notion of and is, by default, fully open to any client issuing commands. The only participation restriction it makes is on what nodes exist at a given time.
Moreover, the leader has absolute authority over the cluster and orders the followers to replicate and commit. It does not assume Byzantine attacks, it needs to only handle crash faults, because nodes are assumed altruistic.
Part 2: Kadena's Predecessors - Tangaroa and Juno
Tangaroa: The first step towards a BFT Raft
Tangaroa is Byzantine Fault Tolerant (BFT) variant of the Raft consensus algorithm inspired by the original Raft algorithm and the Practical Byzantine Fault Tolerance (PBFT) algorithm.
Byzantine fault tolerance refers to a class failures caused by malicious nodes attacking the network. If some of the nodes go down it is imperative for the network to continue running without stopping.
In standard Raft, you need to replicate a log entry to a majority of nodes in the cluster before committing it. For BFT consensus algorithms, including Tangaroa, the required cluster size is at least 2f + 1, where f is the number of failures you want to tolerate (including both crashed nodes and compromised nodes). Consensus is achieved by a majority vote of the cluster; if f <= 3 then cluster size = 7 and non-byzantine nodes = 4. Some BFT protocols can even require 3f+1.
A Byzantine Leader can decide to arbitrarily increase the commit index of other nodes before log entries have been sufficiently replicated, thus causing safety violations when nodes fail later on. Tangaroa shifts the commit responsibility away from the leader, and every node can verify for itself that a log entry has been safely replicated to a quorum of nodes and that this quorum agrees on an ordering.
Tangaroa allows clients that interrupt the current leadership if it fails to make progress, in the same way that other BFT Consensus algorithms allow to client to behave as a trusted oracle to depose certain nodes. This allows Tangaroa to prevent Byzantine leaders from starving the system but is very trusting of the client.
Leader election and stages
Tangaroa uses Raft as the foundation for consensus; thus there is a single leader. In Tangaroa, as in Raft, each node is in one of the three states: leader, follower, or candidate.
Similar to Raft, every node starts as a follower, one of which will eventually timeout and call an election. The winner of the election serves as the leader for the rest of the term; terms end when a new leader is elected. Sometimes, an election will result in a split vote, and the term will end with no leader. In this case, a follower will again time out (timeouts are reset when a vote is cast or an election is called) and start the voting process again.
To begin an election, a follower increments its current term and sends RequestVote (RV) Remote Procedure Call (RPCs) in parallel to each of the other nodes in the cluster asking for their vote. The RPCs Tangaroa uses are similar to Raft's RPCs with the exception that every RPC is signed and validated via PPK signatures.
RPCs allow for a data exchange between different computers residing on a network and the signatures allow for receiving nodes to verify which node sent the RPC in addition to allowing any node to forward any other node's RPC at any time.
When a Tangaroa node receives a RV RPC with a valid signature, it grants a vote immediately only if it does not currently have a leader (only occurs at startup). Otherwise, it begins the process that Tangaroa calls a "LazyVote."
The purpose of a LazyVote is to protect non-Byzantine Followers from electing a new leader when the leader is not faulty; without lazy voting, a byzantine node could trigger repeated elections at any time and starve the system. When a new RV is received by a follower, it saves the RV and waits for all of the following conditions to be met:
a) The follower's election timeout triggers fires before it handles a heartbeat from its current leader. If a heartbeat is received, the LazyVote is cleared.
b) The RV's new term is greater than its current term.
c) The request sender is an eligible candidate (valid PPK signature and the client hasn't banned the node).
d) The node receiving the RV has not voted for another leader for the proposed term.
e) The candidate shares a log prefix with the node that contains all committed entries. A node always rejects the request if it is still receiving heartbeat messages from the current leader, and it ignores the RequestVote RPC if the proposed term has already begun.
If a RequestVote is valid and for a new term, and the candidate has a sufficiently up-to-date log, but the recipient is still receiving heartbeats from the current leader, it will record its vote locally, and then send a vote response if the node itself undergoes an election timeout or hears from a client that the current leader is unresponsive.
Under lazy voting, a node does not grant a vote to a candidate unless it believes the current leader is faulty. This prevents nodes that start unnecessary elections from gaining the requisite votes to become leader and starve the system.
Nodes wait until they believe an election needs to occur before ever casting a vote. Once a vote is sent, the node will update its term number. It does not assume that the node it voted for won the election however, and it will still reject AppendEntries (AE) RPCs from the candidate if none of them contain a set of votes proving the candidate won the election. AE's serve the dual purpose of heartbeats and carriers of new log entries that need replication. The candidate continues in the candidate state until one of the three things happens:
a) It wins the election by receiving a majority vote from the cluster. A candidate must save these votes - RequestVoteResponse (RVR) RPCs - for future distribution.
b) Another node establishes itself as a leader
c) A period of time goes by with no winner (ie: it experiences another election timeout)
A candidate that wins the election then promotes itself to the leader state and sends an AE heartbeat messages that contains the votes that elected it and the updated term number to establish its authority and prevent new elections. The signed votes effectively prevents a byzantine node from arbitrarily promoting itself as the leader of a higher term. Moreover, each follower performs a recount on the aforementioned majority vote, validating and counting each vote the new leader transmitted to independently verify the validity of the election.
Like Raft, Tangaroa uses randomized timeouts to trigger leader elections. The leader of each term periodically sends heartbeat messages (empty AE RPCs) to maintain its authority. If a follower receives no communication from a leader over a randomly chosen period of time, the election timeout, then it becomes a candidate and initiates a new election.
In addition to the spontaneous follower-triggered elections, Tangaroa also allows client intervention: when a client observes no progress with a leader for a period of time called the progress timeout, it broadcasts UpdateLeader RPCs to all nodes, telling them to ignore future heartbeats from what the client believes to be the current leader in the current term. These followers will ignore heartbeat messages in the current term and time out as though the current leader had failed, starting a new election.
The data (new commands) come from clients of the Raft cluster, which send requests to the leader. The leader replicates these requests to the cluster, and responds to the client when a quorum is reached in the cluster on that request.
What constitutes a "request" is system-dependent. How data is stored is system-dependent. It's important for state to persist to disk, so that nodes can recover and remember information that they have committed to (which nodes they voted for, what log entries they have committed, etc). Without this, the protocol will not work.
Tangaroa adds BFT to Raft evolution
The JP Morgan project Juno is fork of Tangoroa and was a proof of concept that was able to scale Tangaroa to include up to 50 nodes and raise transaction speed up to 5,000 transactions per second.
The JPM team behind Juno saw the potential that a Tangaroa-like approach represented - a high performance private blockchain. They iterated on the idea for a year and open sourced the project in February 2016. They added a smart contract language, fixed some design mistakes and succeeded in achieving a 10x performance increase which allowed for the number of nodes to vote to change while the system was running. Juno allowed for the adding and removing of nodes, and was a permissioned distributed system in which all of the nodes in the network were known.
The stages of the mechanism and the leader election process are the same as Tangaroa (see above). Similarly, a transaction is considered live once it is fully replicated and committed to the log.
The leader decides the order of the commands and every node validates. Each node independently decides when to commit a log entry based on evidence it receives from other nodes. Every log entry is individually committed and incrementally hashed against the previous entry. It takes approximately 5ms for a single log entry to go from leader receiving the entry to full consensus being reached and network latency.
Part 3: Kadena's Blockchain - ScalableBFT, Pact, and Pervasive Determinism
Different to Raft, each replica in a BFT Raft system (a family of algorithms that include Tangaroa, Juno, and Kadean's ScalableBFT) computes a cryptographic hash every time it appends a new entry to its log. The hash is computed over the previous hash and the newly appended log entry.
A node can sign its last hash to prove that it has replicated the entirety of a log, and other servers can verify this quickly using the signature and the hash. BFT Raft nodes and clients always sign before sending messages and reject messages that do not include a valid signature.
BFT Rafts use Incremental Hashing enabling nodes to be certain that both the contents and ordering of other node's logs match their own. Using this knowledge, nodes can independently commit log entries safely because both the contents and ordering of other node's logs attested to via matching incremental hashes.
BFT Rafts uses digital signatures extensively to authenticate messages and verify their integrity. This prevents a Byzantine leader from modifying the message contents or forging messages and protects the cluster generally from a large number of Byzantine attacks.
In Raft, a Leader is elected via randomized timeouts that trigger a Follower to propose itself as a Candidate and request votes. ScalableBFT also does this, but in a cryptographically secured way. For instance, if a Leader becomes unreachable, a timeout would trigger a new election but the election process is robust against Byzantine nodes declaring elections. ScalableBFT fixes the issues that Juno and Tangaroa encountered regarding lazy voting.
The Leader's only unique capabilities are: 1) ordering of new transactions prior to replication and 2) replicating new transactions to Follower nodes. From that point on, all nodes independently prove both consensus validity and individual transaction integrity.
The removal of anonymous participation is a design requirement for private blockchains, and this allowed for a high performance BFT Consensus mechanism to replace mining. ScalableBFT's primary addition to the family of BFT Rafts is the ability to scale into the 1000's of nodes without decreasing the system's throughput.
Every transaction is replicated to every node. When a majority of nodes have replicated the transaction, the transaction is committed. Nodes collect and distributed information (incremental hash) about what they have replicated and use this information to independently decide when to commit (>50% of other nodes send them incremental hashes for uncommitted transactions that they agree with).
It basically works by doing a majority vote on what to commit. Committing a transaction doesn't mean it will be executed, just that it has been permanently replicated by a majority of the cluster. Bad transactions, ones that error or have bad signatures, are replicated as well as consensus' job is to provide perfect ordered replication.
Committing a transaction allows each node to then independently evaluate (parse/decrypt/validate crypto/execute/etc...) each transaction in an identical way. Every transaction gets paired with an output, this can range from "bad crypto" to the output of the smart contract layer (which can also be an error).
Finally, besides the leader replicating new transactions to every node the nodes are more or less independent. Instead of "syncing" they broadcast "I have replicated up to log index N and it has an incremental hash of H" to the cluster and collect this information from other nodes - based on the results from other nodes each node can independently decided if the cluster has replicated past the bar needed to commit (a majority replication for some as of yet uncommitted log index N).
Here's the subtle part: the incremental hash implies replication of all that came before it. If the leader replicates 8,000 new transactions (which is what it currently does), each node need only distribute and gather evidence for the last transaction of that batch as it implies correct replication of the ones that came before it. Instead of sending 8,000 messages (one for each transaction) that attest to proper replication nodes only discuss the latest transaction.
This is why Kadena needed so much pipelining, because the team figured out how to commit 8,000 transactions at the same speed as committing a single transaction.
ScalableBFT represents a breakthrough in field of BFT consensus as it is the first and only deterministic BFT consensus mechanism that can scale past hundreds of nodes with full replication and encryption. ScalableBFT also provides a unique security model known as pervasive determinism which provides security not just at the transaction level but at the consensus level as well while encrypting each and every transaction using the Noise Protocol (see below).
Kadena uses deterministic consensus
The consensus mechanism is deterministic if the consensus process is fully specified in the protocol and this process does not employ randomness. As was stated above, Raft, for example, uses randomized timeouts to trigger elections when a leader goes down (since the leader can't communicate "I'm about to crash", there's a timeout that trips to prompt a node to check if the leader is down) but the election isn't part of consensus at the transaction level, it is instead a means to finding a node to orchestrate consensus.
ScalableBFT is deterministic and hardened such that:
- Nodes will commit only when a majority of the cluster agrees with them
- The evidence of agreement must be fully auditable at any time
- When lacking evidence of agreement, do nothing.
Kadena is specifically designed for permissioned networks, and as such it assumes that certain attacks (like a DoS) are unlikely and are out of its control. If one were to occur, the system would either lock (all nodes timeout eventually with but an election would never succeed) or sit idle.
Once such an event ends, the nodes will come back into consensus and things will return to normal. However, in a permissioned network, administrators would have full control and kill the connection causing the issue.
Leader election is very similar to Raft in that any node can be elected leader, every node gets one vote per term, and elections are called when the randomized timeout one of the nodes fires (the timer is reset every time a node hears from the leader).
The biggest difference is that in Raft a node that gets enough votes assumes leadership, whereas in ScalableBFT a node that gets a majority of votes distributes those votes to every other node to demonstrate (in a BFT way) that it has been elected the leader by the cluster.
ScalableBFT's mechanism fixes issues seen in Juno and Tangaroa, like a "runaway candidate" where a non-Byzantine node has timed out due to a network partition but, because its term has been incremented, it can't come back into consensus and instead continues timeout then increments its term ("Runaway".)
Raft consensus guarantees a strict ordering and replication of messages; it doesn't matter what's in each message and can range from random numbers to ciphertext to plain-text smart contracts. Kadena leverages the log layer as a messaging service when running in an encrypted context; much like Signal can run Noise protocol encryption over SMS. ScalableBFT runs Noise over a blockchain.
ScalableBFT adds consensus robustness, which the layer that deals with interpreting the messages assumes as a guarantee, but also incremental hashes that assure perfect replication of messages. Noise protocol slots between consensus and smart contract execution, encrypting/decrypting messages as needed; because the messages are ciphertext only some of the normal tricks for avoiding a Cartesian blowup of live tests are needed to run per message without leaking information.
Security model/pervasive determinism
Kadena uses the term "pervasive determinism" to describe "the idea of a blockchain that uses PPK-Sig based cryptography for authorship guarantees (like bitcoin) and is composed of a fully deterministic consensus layer in addition to a Turing-incomplete, single-assignment smart contract layer.
The implications of a 'pervasively deterministic' blockchain are rather profound, as it allows for a bitcoin-ledger class of auditability to be extended deep into the consensus layer by chaining together multiple layers of cryptographic trust.
Take as an example a transaction that loads a new smart contract module called "loans". Say "loans" imports another module called "payments" that is already present in the chain. The successful import of "payments" alone implies the following (with each being fully auditable by cryptographic means):
- Who signed the transaction that loaded "payments"
- What consensus nodes were in the cluster at the time of loading
- What consensus nodes agreed that the transaction was valid
- What nodes voted for the current leader at the time of loading
- Who the leader was
- Who the previous leader was
A pervasively deterministic system allows new transactions to leverage not only the cryptographic trust that naturally occurs as transactions are chained together in a blockchain, but also the trust of how those transactions entered the ledge in the first place. In so doing, you can create a system more secure than bitcoin because the consensus process becomes as cryptographically trusted, auditable, and entangled as well, with transaction level executions implying that specific consensus level events occurred and with each implication being cryptographically verifiable.
This provides BFT not just for the consensus layer but for the transaction layer (bitcoin already does this) as well. This is different from, say, PBFT which assumes that transactions sent from the client's server are valid which leaves them with an ability to be compromised. Moreover, non-Raft BFTs generally entrust the client with the ability to depose/ban nodes. Pervasive Determinism takes an alternative viewpoint: trust nothing, audit everything.
Allowing ScalableBFT to incorporate pervasive determinism creates a completely paranoid system that is robust at each and every layer via permanent security (ie: a form of cryptographic security that can be saved to disk). It has bitcoin's security model for transactions, extends this model to the consensus level, and adds smart contracts without the need for mining or the tradeoffs that most in the industry have become accustomed to. It's a real blockchain that's fast and scalable.
I asked Will Martino (co-founder of Kadena) for the specifics of how this worked for each layer:
What is your consensus-level security model?
For replication, Kadena uses an incrementally hashed log of transactions that is identically replicated by each node. These agree on the contents of the log via the distributed signed messages containing the incremental hash of a given log index, which are then collected by other nodes and used to individually reason about when a commit is warranted. No duplicates are allowed in the log and replication messages from the leader containing any duplicates are rejected immediately.
We use blake2 hashes and Term number to define uniqueness, allowing clients of the system to not worry about sending duplicates by accident or about a malicious node/man-in-the-middle (MITM) resubmitting commands. We employ permanent security, a PPK-sig-based approach to authorship verification (or any type of approach that can be saved to disk) that is very similar to how bitcoin verifies transactions but at the consensus level (in addition to the transaction level).
This is opposed to ephemeral security which uses secured channels (TLS) for authorship validation - a vastly inferior approach where the question "who sent the transaction X?" is answered not via PPK cryptography but via a consensus-level query because any individual node is incapable of providing a BFT answer.
What is your transaction-level security model?
The ideas of ephemeral and permanent security span both the consensus and transaction level, as it is consensus that hands the smart contract execution layer individual transactions. At the smart contract/transaction level we also use permanent security as well, supporting row level public key authorization natively in Pact.
This is important because ephemeral implies that an attacker is one server away from impersonating an entity; secured channels work by point to point distribution of new transactions by the client/submitter to the cluster nodes over TLS and consensus secures that a given transaction should be committed and replicated. However, if an attacker hacks the client server holding the other end of the TLS connection, they can transact as if they were the client without the cluster being the wiser.
Permanent security, on the other hand, has many keys for individual roles in a given entity thus requiring an attacker to gain access to the individual keys; further, with permanent security the CEO's transactions are signed with a different key than the Mail Clerk's transactions vs ephemeral where the "who is sending this transaction" is determined by a "from: X" field.
If the same TLS connection is used to submit both the CEO's and the Clerk's transactions, then the authorship and authorization logic is a "because I said so/trust me" model vs a PPK-sig approach where you verify against the appropriate key before execution. Kadena's blockchain is designed to trust as little as possible; if we knew of a more paranoid or fine-grained approach than row-level PPK signatures we'd use that instead.
What is your confidential transaction model?
We use Double-Ratchet protocol (what Signal, WhatsApp, etc... use for encrypted communications) embedded into the blockchain (encrypted transaction bodies) for multi-party privacy preserving use cases. We work with the notion of disjoint databases via the 'pact' primitive in Pact - they describe a multiphase commit workflow over disjoint databases via encrypted messages.
Pact is a full smart-contract language, the interpreter of which is built in Haskell. In Kadena, every transaction is a smart contract and the Pact smart contract language is open sourced. Pact is database-focused, transactional, Turing-incomplete, single-assignment (variables cannot be changed in their lifetime), and thus highly amenable to static verification.
Pact is also interpreted - the code you write is what executes on-chain - whereas Solidity is compiled, making it difficult to verify code, and also impossible to rectify security issues in old language versions, once compiled. Pact ships with its own interpreter, but can run in any deterministic-input blockchain, and can support different backends, including commercial RDBMS. In the ScalableBFT blockchain, it runs with a fast SQLite storage layer.
Characteristics of the Kadena Blockchain
The Kadena blockchain contains all these features:
In conclusion, Kadena has developed a fully replicating, scalable and deterministic consensus algorithm for private blockchains with high performance. This blockchain solution can be a giant leap forward for financial services companies looking to employ a real private solution that remains true to many of the key bitcoin blockchain features without mining (proof of work), anonymity and censorship resistance while catering to the key design features that financial services are craving particularly scalability and confidentiality.
This article was previously published on the Sammantics blog and has been reproduced here with permission. Some edits have been made for style and brevity.
Cogs image via Shutterstock