“Practical Byzantine Fault Tolerance, PBFT” Explained

This article is mostly a paraphrase of “Practical Byzantine Fault Tolerance” by Castro and Liskov.

Intro

By reviewing the paper “Impossibility of Distributed Consensus with One Faulty Process” [1], we went through the process of proving that no distributed consensus can satisfy both safety and liveness in an asynchronous network setting. That also means there is no ‘perfect’ distributed consensus, so we have to choose what to compromise between safety and liveness for the practical implementation.

We define safety and liveness as follows (I rephrased the definition mainly from the paper “Practical Byzantine Fault Tolerance” [2]):

  • Safety

    All non-faulty nodes can agree on the sequence of requests, like a centralized implementation that executes operations atomically one at a time. i.e., if there has been a consensus between all non-faulty nodes, every node should be able to access the identical value.

  • Liveness

    Clients who cast their request must eventually receive replies. It implies that the distributed system eventually achieves consensus or in other words, agrees on the same state.

PBFT is a practical implementation of a consensus algorithm that compromises liveness. Here, by ‘compromising’ I mean the existence of ‘synchrony’ because what we have proven to be impossible is the perfect consensus algorithm in an asynchronous network environment. In simple language, PBFT has introduced time-outs to ensure liveness.

This is quite an interesting phenomenon considering Nakamoto consensus because it chooses to compromise the safety, unlike PBFT choosing liveness. Safety can be understood as a blockchain term ‘finality’, and we know that there exist some forks in bitcoin generated mainly because of the network delay. The miners constructing blocks on fork A and fork B respectively do not access the identical value. But block production in bitcoin never stops, even in the asynchronous network environment.

“So, what’s better?”, you might raise a question, but that’s quite difficult to answer. Let’s figure it out step by step.

Definitions

Firstly, we need to demystify a few fuzzy words in this paper before proceeding to the actual PBFT operation.

  • Replica: A node under distributed network circumstance.
  • Fail-stop: Replicas might stop their operation due to errors. In this case, we call this abnormal status a ‘fail-stop’.
  • Byzantine Fault: Replicas might revise the codebase arbitrarily and behave maliciously to the whole network. This is not simply ‘fail-stop’ status, so in this case, we call that specific replica a faulty node or Byzantine node.
  • Asynchronous Distributed System: replicas are connected by a network that may fail to deliver the message by delaying them. i.e., no time-outs.

The PBFT algorithm provides both safety and (compromised) liveness assuming no more than [n-1/3] replicas are faulty. That means, if there are f amount of Byzantine nodes, the PBFT algorithm works only if the total number of 3f + 1 nodes are in the network (thus, we need 2f + 1 normal replicas). But why 3f + 1?

  1. Faulty replicas might not be responding to the communication. So we need n-f replicas to be greater than 0 for successful communication, where n is the total number of replicas in the network. Thus, n > f.
  2. We can think of a possible situation where f faulty replicas are in fact responding to the communication requests (of course maliciously, but other normal replicas can’t ensure whether they are honest or not), but another f amount of normal replicas are not responding. Still, we want the communications to proceed so n - 2f (f replicas that do not respond + f replicas that do respond but are faulty) should be greater than f (faulty nodes). Thus n -2f > f, and n > 3f.
  3. n > f and n > 3f. Thus n > 3f

n is an integer value, so we need minimal 3f + 1 replicas in the network.

Normal-case Operation

So, finally we are here. Let’s now dig down the actual operation of the PBFT consensus algorithm. Here is the illustration for the lifecycle from the request to the reply:

PBFT Operation. Source: Practical Byzantine Fault Tolerance.
PBFT Operation. Source: Practical Byzantine Fault Tolerance.

‘C’ denotes a client who casts its request and expects to receive the result of its operation (or computation) in the form of ‘reply’. ‘0’ here is important in that it is the special replica that leads this stage, while the other 1, 2, 3 are the normal replicas. We call replica 0 as ‘primary’ and the others ‘backups’.

We need to keep in mind that all replicas (including primary) contain the state, a message log of accepted messages, and an integer denoting the replica’s current view which is a stage, or in a blockchain context, block number. A single view consists of roughly three phases: pre-prepare, prepare and commit.

As an overview of a view, a request from the client is handled through four steps: (1) a client sends a request to the primary, (2) the primary multicasts the request to the backups, (3) replicas execute the request and send a reply to the client, (4) the client waits for f+1 replies from different replicas with the same result. More sophisticated steps are shown below:

  1. Firstly, client C requests the execution of state machine (replica) operation by sending a ‘REQUEST’ type message to the primary.
  2. The primary receives the request from the client. This is the ‘request’ phase of the illustration above.
  3. The pre-prepare phase now starts. The primary multicasts ‘PRE-PREPARE’ message to all the backups, with the information about the view and sequence number (order of request to be operated within a view) to which the request is assigned. This ‘PRE-PREPARE’ message is important in that the information the message contains is used as proof later if the primary is revealed to be a faulty replica.
  4. All the backups accept a ‘PRE-PREPARE’ message from the primary replica after validation: they check whether the message signature is valid, and it is in adequate view, etc. If the messages are successfully received, then the pre-prepare phase now ends.
  5. Prepare phase starts. Each backup node multicasts a ‘PREPARE’ type message to all the other nodes and adds both the ‘PRE-PREPARE’ and ‘PREPARE’ message to its local log.
  6. A replica (including primary) accepts prepare messages and adds them to its log provided the information match with the ‘PRE-PREPARE’ message and is valid: it checks whether their signatures are valid, and their view number equals the replica’s current view, etc. Also, it checks 2f ‘PREPARE’ messages match with the ‘PRE-PREPARE’ message. If all the processes were successful then prepare phase now ends.
  7. The commit phase now starts. All the replicas multicast ‘COMMIT’ type message to the other replicas. Replicas accept commit messages and insert them in their log after the validation process same with the ‘PREPARE’ phase.
  8. Each replica ensures that prepare phase was successful and it has accepted 2f + 1 ‘COMMIT’ type messages from different replicas that match the pre-prepare for the request. If everything so far is successful, then the replica executes (or computes) the request. The commit phase now ends.
  9. A replica sends the ‘REPLY’ type message to the client. The message contains (1) information of view number (similar to block height), (2) timestamp, (3) result of executing the requested operation, and (4) the unique number of replica who sends this message.
  10. The client waits for f+1 valid replies from different replicas before finally accepting the result of the operation.
  • pre-prepare and prepare phases so far ensure requests are totally ordered within the same view.
  • prepare + commit phases are used to ensure that requests that commit are totally ordered across views.

Back to the illustration above, you might find backup replica 3 does not respond to the messages the others have sent. We don’t know whether it is in just fail-stop status or in fact a real Byzantine node but no matter what kind of replica it is actually, the client will successfully receive a reply of his request because there will be more than f (in this case, 1) confirmations from non-faulty replicas.

Even if the replica 3 is revealed to be a Byzantine node that responds to the message maliciously, the client will receive enough amount of correct messages thus should be able to confirm its validity.

View Changes

But what if the primary replica is a Byzantine node? It would take forever to make a consensus if the primary stops multicasting any messages. This is totally possible in an asynchronous network setting where infinite message delay is acceptable behavior. As we know already, PBFT introduces time-out i.e., synchrony to solve this critical problem to the liveness property: distributed system eventually achieves consensus.

All the replicas have their own timer which starts counting once it receives a request. When the timer of a backup expires in a specific view (let us denote that view as ‘v’), it stops accepting messages and multicasts a ‘VIEW-CHANGE’ type message to all replicas.

The new primary replica of view ‘v + 1‘ would receive 2f (not 2f + 1, because the former primary replica turned out to be a faulty node) valid ‘VIEW-CHANGE’ message and start new view ‘v + 1’ by sending ‘NEW-VIEW’ message to all other replicas.

This view change mechanism is the most interesting point of PBFT because this is why this consensus mechanism has the word ‘practical’ in its name. Under the PBFT mechanism, all the nodes process the serializable requests atomically with almost instant finality, and it is a quite meaningful property in the context of blockchain considering blockchain-specific operations like P2P payment. There always exists a risk of chain reorganization(reorg) in the implementation using the longest-chain rule (e.g., Nakamoto consensus) but PBFT is not, due to its maximized safety property. But this is only possible with the trade-off: compromised asynchrony as we have seen in the implementation of view change mechanism for the enhanced liveness performance.

Outro

There is one more weakness in the PBFT algorithm: heavy network consumption. To finalize a single view, the system requires roughly three phases that need for the message multicast. One-to-many messaging protocol consumes network bandwidth exponentially even though the number of participant nodes grows linearly. This is why PBFT is often combined with DPoS because it limits the number of maximum validator nodes under a certain capable threshold. And surely this leads to the innate limit on the degree of decentralization in the context of blockchain.

It is not a matter of “what mechanism takes an absolute advantage over the other” when choosing between BFT-family consensus and Nakamoto consensus. Each offers its own advantages with trade-offs thus, what is truly important is to start by defining the general purpose of the system (or the blockchain) and then choosing the mechanism that ‘fits’ that purpose.

References

[1] M. Fisher, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus With One Faulty Process. In Journal of the ACM, 1985.

[2] M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, 1999.

Subscribe to imlearning.eth
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.