In the field of smart contracts, the "Ethereum Virtual Machine EVM" and its algorithms and data structures are first principles.
This paper starts from why the contract should be classified, combined with what kind of malicious attacks each scenario may face, and finally gives a set of relatively secure contract classification analysis algorithm.
Although the technical content is higher, it can also be used as a miscellaneous reading A look at the dark forest of games between decentralized systems.
Because it is so important, it can be described as the cornerstone of DApps such as exchanges, wallets, blockchain browsers, data analysis platforms, and so on!
A transaction is an ERC20 transfer because it meets the ERC20 criteria, with at least:
The status of the transaction is successful
The To address is an ERC20-compliant contract
The Transfer function is called, which is characterized by the first 4 digits of the CallData of the transaction 0xa9059cbb
After execution, it was issued at the address transfer
Event of
If the classification is wrong, the trading behavior will be misjudged
Taking the transaction behavior as the cornerstone, whether the address can be accurately classified will have a completely different conclusion on its CallData judgment. For Dapps, on-chain communication is highly dependent on the monitoring of transaction events, and the same event coding is only credible if it is issued in a contract that meets the standards.
Wrong classification leads to transactions going into black holes
If the user makes a Token transfer into a contract, if the contract does not have a preset function method for Token transfer out, the funds will be locked and uncontrollable like Burn
And now that a large number of projects have begun to increase built-in wallet support, it is inevitable to manage wallets for users, and it is necessary to classify the latest deployed contracts from the chain in real time, whether they can meet the asset standards.
Paragraph 1: CryptoPunk is the world's first decentralized NFT trading market
The chain is a place where there is no identity and no rule of law, and you can't stop a normal transaction, even if it is malicious.
He could be a Wolf posing as Grandma, doing most of the granny behavior you'd expect, but for the purpose of a house robbery.
Declare standards, but may not actually meet them
The common classification method is to directly use the EIP-165 standard to read whether the address supports ERC20, etc., of course, this is an efficient method, but after all, the contract is controlled by the other side, so it is possible to forge a statement after all.
The 165 standard query is only a method to prevent funds from being transferred into the black hole at the lowest cost in the limited opcodes on the chain.
This is why when we analyzed NFT before, we specifically mentioned that there would be a class of standards SafeTransferFrom
Of the methods, which Safe
It refers to the use of the 165 standard to judge that the other party has declared that it has the transfer ability of NFT.
Paragraph 2.2: What is the NFT you bought?
Only by starting from the contract bytecode, do static analysis at the source level, and start from the expected behavior of the contract has a more accurate possibility.
Next, we will systematically analyze the overall scheme, pay attention to our The purpose is to "precision" and "efficiency" two core indicators .
Know that even if the direction is right, but the road to the other side of the ocean is not clear, the first stop to do bytecode analysis is to get the code
From a post-chain point of view, yes getCode An RPC method that retrieves the bytecode from a specified address on the chain, which is very fast to read alone because codeHash is placed at the top of the EVM account structure.
But this method is equivalent to obtaining an address separately, and you want to further improve the accuracy and efficiency?
If it is a transaction for a deployment contract, how do you get the deployed code right after it is executed and even while it is still in the memory pool?
If the transaction is a contract factory model, does the source code exist in the Calldata of the transaction?
My final approach is to sort through a sifter-like pattern
For non-contract deployed transactions, it is directly used getCode Get the addresses involved and sort them,
For transactions in the latest memory pool, filter out those with an empty to address, and the CallData is the source code with the constructor
For contract factory mode transactions, since contracts deployed by contracts may recycle to other contracts to perform deployment, the sub-transactions of the transaction are analyzed recursively, and each type is recorded CREATE or CREATE2 The Call of...
When I made a demo implementation, I found that the version of rpc is relatively high now, because the most difficult thing in the whole process is to perform 3, how to recursively find the specified type of call, the lowest way is to restore the context through opcode, I was surprised!
Good thing it's in the current version of geth debug_traceTransaction Method, he can help solve by combing the context information of each call through opcode opcode, sorting out the core fields.
Eventually, raw bytecode for multiple deployment modes (direct deployment, factory mode single deployment, factory mode batch deployment) can be obtained.
The simplest but not safe way is to match the code directly to the string, taking ERC20 as an example
After the function name, it is the function signature of the function. As mentioned in the previous analysis, transactions rely on matching the first 4 bits of callData to find the objective function.
So the signatures of these six functions must be stored in the contract bytecode.
Of course, this method is very fast to find all 6, but the unsafe factor is that if I use solidity contract, design a separate variable, store the value of 0x18160ddd So he's going to think that I have this function.
The further accurate way is to decompile Opcode! Decompilation is the process of converting the obtained bytecode to the opcode, and more advanced decompilation is to convert it into pseudo-code, which is easier for people to read, and we will not use it this time. The decompilation method is listed in the appendix at the end of this article.
solidity (High Level language)-> bytecode(Bytecode)->opcode(opcode)
We can clearly find a feature, the function signature will be PUSH4 This opcode is executed, so a further method is extracted from the full textPUSH4After the content, match the function standard.
I also did a simple performance experiment, have to say that the efficiency of the Go language is very powerful, 1W decompilation only needs 220ms.
The next part is going to be a little bit difficult
The above accuracy is improved but not enough, because it is a full-text search PUSH4 Because we can still construct a variable that is byte4 Type, so that it will also trigger PUSH4 The instruction.
When I was distressed, I thought of the implementation of some open source projects, ETL is a tool to read the data on the chain to do analysis, which will parse the transfer of ERC20, 721 into a separate table, so it must have the ability to classify contracts.
When analyzed, it can be found that it is based on the classification of code blocks and only handles the first one basic_blocks[0] interior push4 order
So the question is, how do you accurately judge a block of code
The concept of code blocks comes from REVERT + JUMPDEST These two consecutive opcodes, there must be two consecutive opcodes, because in the entire function selector opcode interval, if the number of functions is too large, there will be page-turning logic, which will also appear JUMPDEST This instruction.
The function selector reads the first 4 bytes of the Calldata of the transaction and matches it with the default contract function signature in the code to help the instruction jump to the memory location where the function method is stored
Let's try a minimal simulation execution
This part is the selector of the two functions store(uint 256) and retrieve() I can figure out that the signature is 2e64cec1, 6057361d
60003560e01c80632e64cec11461003b5780636057361d1461005957
After decompiling, you get the following opcode string, which can be said to be divided into two parts
Part I:
In the compiler, only the function selector part in the contract will get the content of callData, which means to get the function call signature of its CallData, annotated in the following figure.
We can see the effect by simulating the memory pool changes of the EVM
Part Two:
The process of determining whether the value matches that of the selector
1.Pass the 4-byte function signature for retrieve() (0x2e64cec1) onto the stack,
2. The EQ opcode pops two variables from the stack area, namely 0x2e64cec1 and 0x6057361d, and checks whether they are equal
3. PUSH2 passes 2 bytes of data (0x003b in this case, 59 in decimal) into the stack, which contains something called the program counter, which specifies the position of the next execution command in the bytecode. Here we set 59, because that's the starting position of the retrieve() bytecode
4. JUMPI stands for "If... ", then jump to..." It pops 2 values from the stack as input, and if the condition is true, the program counter is updated to 59.
This is how the EVM determines the location of the function bytecode it needs to execute based on the function call in the contract.
In reality, this is just a set of simple "if statements" for each function in the contract and where they jump.
The overall outline is as follows
Each contract address can be accessed through rpc getcode perhaps debug_traceTransaction To obtain the post-deployment data bytecode The VM and ASM libraries in GO are used and obtained after decompilation opcode
Contract on EVM In the operation principle, there will be the following characteristics
adopt REVERT+JUMPDEST These two are consecutive opcode As a distinction of code blocks
The contract must have the function selector function, and that function must also be on the first code block
In the function selector, its function methods are used PUSH4 Act as opcode ,
The selector contains an opcode that appears consecutively PUSH1 00; CALLDATALOAD; PUSH1 e0; SHR; DUP1 The core function is to load callDate data and perform displacement operations, and other syntax will not be generated from the contract function
The corresponding function signature is in eip Is defined, and there are explicit descriptions of required and optional options
We can say that we basically achieve an efficient and high-accuracy contract analysis method. Of course, since we have been rigorous for so long, we may wish to be more rigorous. In our above scheme, we distinguish code blocks based on REVER+JUMPDEST, combined with the inevitable CallDate loading and displacement to make unique judgments. Can I implement similar opcode sequences using solidity contracts?
I ran a controlled experiment from solidity There are grammatical aspects, though msg.sig Equal access CallData The method, but after compiling it opcode The implementation methods are different
After this analysis, three days have passed.
Although very detailed, although the daily encounter contract adoption byte4 A contract to maliciously obfuscate your compliance with a standard may be a drop in the ocean.
So, in fact, the ROI of the 3-day investment analysis is very low. But in the endless river of time, the probability of small things, will eventually happen.
Only with the principle of distrust can we go further in the world of web3.
Today, a good friend asked me about a very thoughtful topic, What is the endgame as an industrial KOL? What is your business model?
Although I know that technical articles always have poor traffic, but traffic is not the purpose, I hope that it is always on the original vision:
💡 From the perspective of technology, insight into the key changes in industrial development, share the past experience and future opportunities in industrial development, and provide differentiated help for builders in the new era.
If you need opcode sample and Go decompilation sample code, you can reply to "contract classification" in the background of the public number.
Dependent libraries:
https://pkg.go.dev/github.com/ethereum/go-ethereum@v1.11.6/core/asm#Compiler.Feed
https://pkg.go.dev/github.com/ethereum/go-ethereum@v1.11.6/core/vm
Principle analysis:
https://whileydave.com/2023/01/04/disassembling-evm-bytecode-the-basics/
https://yanniss.github.io/elipmoc-oopsla22.pdf
https://yanniss.github.io/gigahorse-icse19.pdf
https://www.evm.codes/?fork=shanghai
https://learnblockchain.cn/article/4800
https://learnblockchain.cn/article/3647
Open source project source code reference:
https://github.com/blockchain-etl/ethereum-etl/blob/2da9d050f4ae4fa4e818bbfb22d5cfb5234b2e29/ethereumetl/service/eth_con tract_service.py#L29-L43