In 2008, Satoshi Nakamoto introduced Bitcoin, marking the beginning of the cryptocurrency era. However, Bitcoin's low TPS (transactions per second) cannot meet the needs of modern large-scale commercial applications. For instance, Bitcoin processes an average of 7 transactions per second, while Visa processes between 1500 and 2000 transactions per second. Consequently, enhancing the TPS of blockchain technology has become a crucial issue in the Web3 domain. Over the past decade, numerous scientists have attempted to improve blockchain performance at the consensus algorithm level. However, improvements at this level are ultimately limited by the blockchain trilemma: decentralization, security, and scalability. Enhancing scalability (i.e., TPS) often means compromising either decentralization or security. Moreover, after years of exploration, it seems that enhancements in consensus layer TPS have reached a bottleneck.
As a result, much attention has shifted towards optimizing the execution layer to further tap into the potential of blockchain performance and break through the public chain throughput bottleneck. The most significant direction in execution layer optimization is the parallelization of transaction execution, which has led to the emergence of numerous parallel VM projects, such as Aptos, Sui, Monad, and Solana.
However, in blockchains with smart contract capabilities, transactions can range from simple native token transfers to calls involving smart contracts with extensive computational logic. Even though transactions might be similar, the computational requirements can vary greatly. Therefore, TPS as a performance metric has its limitations. For example, on the Ethereum mainnet, a simple ETH transfer requires only 21,000 Gas, whereas a Swap might need up to 420,000 Gas, a 20-fold difference.
Gas Per Second (GPS) could be a more accurate metric, defined as the target Gas usage per block divided by the block time. It reflects the computational workload a blockchain network can handle per second. Since EVM-compatible chains use Gas pricing defined by EVM as a standard, GPS is a suitable metric for evaluating performance across EVM-compatible chains. For non-EVM compatible chains like Aptos, Sui, and Solana, different Gas pricing standards make horizontal comparisons inappropriate. Efforts are underway to establish GPS as a unified standard across all chains.
At its core, a blockchain is a distributed ledger database, where the essence is a replicated state machine. All nodes in the network must maintain consensus on the global state. This is achieved by periodically publishing blocks containing specified transactions, which each node executes to update to a new state. However, to avoid inconsistencies in the states reached by nodes after executing the transactions in a block, traditional blockchains can only execute transactions sequentially. In the context of modern hardware architectures, as CPU multicore capabilities continue to advance, this approach appears to be inefficient.
Transaction execution is a process that demands significant CPU resources. The parallelization of transaction processing can optimize the use of contemporary CPU structures. This assertion is supported by empirical evidence from prior research. In essence, parallelization distributes tasks across multiple computational units for simultaneous execution, which, according to logical deduction, can yield a performance enhancement multiple times greater than that of sequential processing. This is akin to expanding a single-lane road into a multi-lane highway, naturally improving throughput. In blockchain, parallelization involves executing transactions in a block in parallel across multiple threads while ensuring the final state is the same as if they were executed sequentially.
The core issue with parallel computing is managing dependencies between states, ensuring that execution threads do not read from and write to storage units unrestrainedly, which can lead to data chaos and incorrect computation results. For example, suppose before a block is executed, Alice, Bob, and Charlie each have 100 ETH. If Tx1 is Bob transferring 1 ETH to Alice and Tx2 is Charlie transferring 1 ETH to Alice, processing Tx1 and Tx2 in parallel could result in both transactions writing Alice’s balance as 101 ETH, inconsistent with the 102 ETH if executed sequentially.
For this issue, on the database level, Sharding and DAG are two commonly adopted solutions, with typical examples including Ton, Near, and Fantom; As for the blockchain execution engines, they can be categorized into two types of approaches: State Access and Optimistic Parallelization.
In this article, we will focus on the VM layer.
The State access method proactively identifies the specific parts of the blockchain state that transactions interact with, thereby determining the independence of transactions before execution. Sui, as a representative of this approach, adopts an object-centric hybrid model that facilitates the static capture of dependencies between transactions, enhancing parallel efficiency. Below, we will use Sui as an example to explain the State Access method.
In the design of Sui, there are two types of objects: Owned Objects and Shared Objects. Owned Objects are similar to the UTXO model, where each Owned Object can only be possessed by a single address and does not involve other states on the chain. For example, if Alice wants to transfer 1 USDT to Bob, in the account model, it would require modifying the account balances of both Alice and Bob. However, in the Owned Object model, it only requires changing the owner of the Object.
Therefore, transactions interacting with Owned Objects can be processed in complete parallel, as all modifications are confined to the Object itself.
Compared to UTXO, Owned Objects include an additional version field, allowing the state of Owned Objects to be changed multiple times within the same block. Shared Objects, on the other hand, are similar to the account model, where any transaction can potentially alter the state of a Shared Object. Hence, transactions within the same block could conflict, requiring consensus (ordering) among nodes before operations on Shared Objects can be performed.
Overall, the State Access method provides a more flexible architecture at the foundational level, allowing developers to explicitly define transaction dependencies in advance. To reap the benefits of parallelization, developers typically need to construct more abstract models, ensuring that the results of each transaction are predictable to avoid the high latency associated with retrospective verification.
Optimistic parallel execution is a prevalent paradigm extensively employed in distributed database systems. This model typically operates under the initial assumption that transactions are conflict-free. Upon the occurrence of conflicts, optimistic parallel execution customarily reverts the conflicting transactions and re-executes them. The paradigm of optimistic parallel execution obviates the need for additional overhead associated with synchronization across multiple threads. Consequently, in scenarios where conflicts are infrequent, it enables transactions to be parallelized to their maximal potential. Aptos's BlockSTM is the most known Optimistic Parallelization algorithm in blockchain.
BlockSTM is an innovative algorithm for executing blockchain transactions based on Software Transactional Memory (STM), detecting and managing conflicts by monitoring memory access without needing to declare dependencies in advance. Leveraging blockchain features such as batch transaction submission, virtual machine execution, and pre-determined transaction order, BlockSTM enhances performance beyond traditional STM.
BlockSTM uses optimistic concurrency control and dynamic dependency estimation, employing multi-version data structures (MVMemory), cooperative schedulers, and other technologies to significantly improve transaction execution performance. Simply put, the BlockSTM execution engine optimistically executes transactions in parallel according to the set transaction order of a block, records each transaction's read-write set, and subsequently verifies to detect conflicts. If verification fails, the transaction is aborted and re-executed.
To improve the efficiency of re-execution, a multi-version concurrency control mechanism is used, maintaining multiple versions of the data written by each transaction and allowing reads from earlier versions. It uses the write-set of aborted transactions to estimate dependencies for subsequent executions, avoiding costly dependency pre-calculation. A cooperative scheduler coordinates execution and verification tasks among multiple threads, prioritizing verification and re-execution of earlier transactions to resolve dependencies of subsequent transactions and prevent cascading aborts and re-executions.
With the official announcement of substantial fundraising for parallel EVM projects like Monad, parallelization has gradually become a hot topic in the market. Many projects have also announced upgrades to parallel VMs, each employing different methods.
Projects such as Solana, Sui, Fuel, and Aptos have specifically designed monolithic blockchain architectures suitable for parallelization, adopting virtual machine designs different from the EVM to attempt improvements at the performance level. They rely on more robust node performance to achieve higher throughput. Among these, Solana, Sui, and Fuel use the State Access method, while Aptos employs Optimistic Parallelization.
To strengthen the connection with the Ethereum ecosystem, some projects have attempted to integrate the EVM into their existing parallelized public blockchains, such as Sei and Neon. Sei V2 has incorporated Geth into Sei nodes as part of the core Sei binary files to handle Ethereum transactions; Neon has introduced the EVM into Solana, running an EVM emulator on Solana to convert Ethereum transactions into Solana transactions for execution.
Additionally, some projects have positioned high-performance VMs as the execution layer for ETH L2, thereby establishing a connection with Ethereum. For example, Eclipse uses SVM as its L2 execution layer, and Lumio has introduced Aptos's MoveVM as the execution engine. Interestingly, Eclipse has achieved EVM compatibility through its collaboration with Neon using SVM.
ETH Maxi has directly carried out parallelization modifications on the EVM. Early EVM-compatible Layer1s such as Polygon and BNB Chain have utilized the concept of Optimistic Parallelization and, referencing the BlockSTM solution, have modified the original EVM for parallelization, achieving a modest performance improvement. Monad, Reth, and MegaETH aim not only at the virtual machine level but also at a comprehensive overhaul of the entire client implementation. They optimize various aspects including Optimistic Parallelization, state management, execution scheduling, database read/write operations, and network communications. In fact, this approach converges with high-performance Layer1 public blockchains in achieving its goals.
Substrate is a modular blockchain development framework built using the Rust programming language, offering developers a high degree of customizability to tailor modules such as account management, consensus mechanisms, and transaction processing logic according to their specific needs. The Starknet Appchain Stack Madara is developed based on Substrate. Compared to the State Access method, the BlockSTM design is completely decoupled from the VM format, making it more suitable for building modular blockchains. Therefore, we have introduced BlockSTM into Substrate and combined various optimization methods to break through throughput bottlenecks.
The operational process of Substrate BlockSTM is illustrated in the following diagram, detailing the entire sequence from block construction to execution and finally to submission.
The block production consensus first selects a Block Producer.
The Block Producer calls the Proposer module to build the block.
The Proposer module retrieves transactions in batches from the Transaction Pool.
The Proposer sends a batch of transactions to the ParallelBlockBuilder.
The ParallelBlockBuilder sends the transaction batch to the ParallelExecutor.
The ParallelExecutor initiates BlockSTM for parallel processing, with each transaction executed by an independent StateMachine.
The ParallelFrameExecutive records the execution results into the substrate frame system.
After parallel transaction execution, the Block Builder calculates fields like StateRoot and TransactionRoot, assembling the block.
The BlockBuilder passes the BlockImportParams to the Client for block importation.
Through the analysis of Substrate's code structure, we have chosen to create the StateMachine as the smallest unit of concurrency. Each StateMachine can be regarded as a stateless virtual machine, with the necessary state for execution passed as parameters at the time of creation through the StateMachine context. Each StateMachine is capable of receiving an executor for internal execution. In simpler terms, the executor is a VM that is chain-specific, such as the Cario VM or EVM. Considering the scalability of Chakra, each executor is transparent to the SM. StateMachines operate in isolated environments, ensuring that different transactions do not affect each other.
To detect dependencies between transactions within the same block, it is necessary to record each transaction's read and write set. Substrate does not provide an interface for this, so we customized the interface between the wasm virtual machine and the host to significantly reduce memory usage, recording the read and write set of each transaction with minimal overhead. When conflicts occur in these sets, transaction dependencies are detected, and conflicting transactions are re-executed to obtain accurate results.
The Overlay Change Set can be considered a cache that will be applied to the KV database, enhancing performance through optimized access to parallel execution state information. When Substrate Runtime reads or writes data, it first goes through the Overlay Change Set, temporarily storing the transaction's read and write operations during execution. Thus, parallel transaction execution does not contend for database write locks, avoiding read-write conflicts and reducing I/O latency. After all transactions are validated, the final state is batch-written to the database, achieving asynchronous I/O.
In the native Substrate propose logic, the proposer, builder, and executor each pass only a single transaction. If the time consumed exceeds the preset limit, the block packaging work is terminated.
We use the batch commit method for optimization. This approach offers three distinct advantages:
a) It significantly reduces the invocation latency between different modules, as the overhead of N transaction calls is compressed into a single one.
b) It allows a block to contain more transactions.
c) It substantially increases the throughput.
For transactions that fail to be packaged within the allotted time, the executor notifies the proposer, who then returns the unpacked transactions to the mempool.
A key difference between Substrate Runtime and the design of Aptos is that contracts are isolated from each other. Therefore, during transaction execution in Substrate, some global storage, which we refer to as the global Key, may be accessed. Using the MVMemory from BlockSTM, we resolved the read and write issues of Substrate's global Key, achieving efficient State Merging.
For global keys, the write content of each transaction is considered non-conflicting. To enhance parallelism, we optimize this issue by partitioning the global keys. After all transactions are completed, we assemble the write sets of these Keys in version order. During BlockSTM incarnation execution, no data is written to MVMemory. Instead, MVMemory applies changes after transactions succeed. Therefore, unlike Substrate’s Overlay Change Set, there is no need for a complex rollback mechanism, significantly enhancing the efficiency of state management.
In our implemented MVP, a 4-thread Substrate BlockSTM has already achieved a TPS of over 5,000, an improvement of 5.6 times compared to the original Substrate engine, exceeding the performance ratios achieved by Aptos's BlockSTM benchmarks. If the number of threads is increased to 64, similar to current high-performance public blockchains, our solution could potentially reach a TPS of 100,000. Optimization for Parallel VMs is ongoing, and standards for measuring more reflective performance metrics like Gas Per Second are currently being developed.
Chakra Chain will become one of the first beneficiaries, serving as the first 100,000 TPS high-performance chain in the Bitcoin ecosystem, contributing to the advent of the Bitcoin DeFi Summer.
We have also contribute BlockSTM solution to the Starknet Appchain Stack Madara which is expected to elevate Madara's sequence performance to new heights. Starknet will officially introduce BlockSTM in its next network upgrade, enhancing the mainnet's performance to 300-1000 TPS.
Learn more about Chakra: https://linktr.ee/ChakraChain