Delve into the risks behind the small matter of EVM-contract classification

In the field of smart contracts, the "Ethereum Virtual Machine EVM" and its algorithms and data structures are first principles.

This paper starts from why the contract should be classified, combined with what kind of malicious attacks each scenario may face, and finally gives a set of relatively secure contract classification analysis algorithm.

Although the technical content is higher, it can also be used as a miscellaneous reading A look at the dark forest of games between decentralized systems.

1. Why are contracts classified?

Because it is so important, it can be described as the cornerstone of DApps such as exchanges, wallets, blockchain browsers, data analysis platforms, and so on!

A transaction is an ERC20 transfer because it meets the ERC20 criteria, with at least:

  1. The status of the transaction is successful

  2. The To address is an ERC20-compliant contract

  3. The Transfer function is called, which is characterized by the first 4 digits of the CallData of the transaction 0xa9059cbb

  4. After execution, it was issued at the address transfer Event of

If the classification is wrong, the trading behavior will be misjudged

Taking the transaction behavior as the cornerstone, whether the address can be accurately classified will have a completely different conclusion on its CallData judgment. For Dapps, on-chain communication is highly dependent on the monitoring of transaction events, and the same event coding is only credible if it is issued in a contract that meets the standards.

Wrong classification leads to transactions going into black holes

If the user makes a Token transfer into a contract, if the contract does not have a preset function method for Token transfer out, the funds will be locked and uncontrollable like Burn

And now that a large number of projects have begun to increase built-in wallet support, it is inevitable to manage wallets for users, and it is necessary to classify the latest deployed contracts from the chain in real time, whether they can meet the asset standards.

Paragraph 1: CryptoPunk is the world's first decentralized NFT trading market

2. What are the risks of classification?

The chain is a place where there is no identity and no rule of law, and you can't stop a normal transaction, even if it is malicious.

He could be a Wolf posing as Grandma, doing most of the granny behavior you'd expect, but for the purpose of a house robbery.

Declare standards, but may not actually meet them

The common classification method is to directly use the EIP-165 standard to read whether the address supports ERC20, etc., of course, this is an efficient method, but after all, the contract is controlled by the other side, so it is possible to forge a statement after all.

The 165 standard query is only a method to prevent funds from being transferred into the black hole at the lowest cost in the limited opcodes on the chain.

This is why when we analyzed NFT before, we specifically mentioned that there would be a class of standards SafeTransferFrom Of the methods, which Safe It refers to the use of the 165 standard to judge that the other party has declared that it has the transfer ability of NFT.

Paragraph 2.2: What is the NFT you bought?

Only by starting from the contract bytecode, do static analysis at the source level, and start from the expected behavior of the contract has a more accurate possibility.

3. Design of contract classification scheme

Next, we will systematically analyze the overall scheme, pay attention to our The purpose is to "precision" and "efficiency" two core indicators .

Know that even if the direction is right, but the road to the other side of the ocean is not clear, the first stop to do bytecode analysis is to get the code

3.1. How to get the code?

From a post-chain point of view, yes getCode An RPC method that retrieves the bytecode from a specified address on the chain, which is very fast to read alone because codeHash is placed at the top of the EVM account structure.

But this method is equivalent to obtaining an address separately, and you want to further improve the accuracy and efficiency?

If it is a transaction for a deployment contract, how do you get the deployed code right after it is executed and even while it is still in the memory pool?

If the transaction is a contract factory model, does the source code exist in the Calldata of the transaction?

My final approach is to sort through a sifter-like pattern

  1. For non-contract deployed transactions, it is directly used getCode Get the addresses involved and sort them,

  2. For transactions in the latest memory pool, filter out those with an empty to address, and the CallData is the source code with the constructor

  3. For contract factory mode transactions, since contracts deployed by contracts may recycle to other contracts to perform deployment, the sub-transactions of the transaction are analyzed recursively, and each type is recorded CREATE or CREATE2 The Call of...

When I made a demo implementation, I found that the version of rpc is relatively high now, because the most difficult thing in the whole process is to perform 3, how to recursively find the specified type of call, the lowest way is to restore the context through opcode, I was surprised!

Good thing it's in the current version of geth debug_traceTransaction Method, he can help solve by combing the context information of each call through opcode opcode, sorting out the core fields.

Eventually, raw bytecode for multiple deployment modes (direct deployment, factory mode single deployment, factory mode batch deployment) can be obtained.

3.2. How Classification from code

The simplest but not safe way is to match the code directly to the string, taking ERC20 as an example

After the function name, it is the function signature of the function. As mentioned in the previous analysis, transactions rely on matching the first 4 bits of callData to find the objective function.

So the signatures of these six functions must be stored in the contract bytecode.

Of course, this method is very fast to find all 6, but the unsafe factor is that if I use solidity contract, design a separate variable, store the value of 0x18160ddd So he's going to think that I have this function.

3.3, accuracy improvement 1- Decompile

The further accurate way is to decompile Opcode! Decompilation is the process of converting the obtained bytecode to the opcode, and more advanced decompilation is to convert it into pseudo-code, which is easier for people to read, and we will not use it this time. The decompilation method is listed in the appendix at the end of this article.

solidity (High Level language)-> bytecode(Bytecode)->opcode(opcode)

We can clearly find a feature, the function signature will be PUSH4 This opcode is executed, so a further method is extracted from the full textPUSH4After the content, match the function standard.

I also did a simple performance experiment, have to say that the efficiency of the Go language is very powerful, 1W decompilation only needs 220ms.

The next part is going to be a little bit difficult

3.4、improve accuracy 2- Find code blocks

The above accuracy is improved but not enough, because it is a full-text search PUSH4 Because we can still construct a variable that is byte4 Type, so that it will also trigger PUSH4 The instruction.

When I was distressed, I thought of the implementation of some open source projects, ETL is a tool to read the data on the chain to do analysis, which will parse the transfer of ERC20, 721 into a separate table, so it must have the ability to classify contracts.

When analyzed, it can be found that it is based on the classification of code blocks and only handles the first one basic_blocks[0] interior push4 order

So the question is, how do you accurately judge a block of code

The concept of code blocks comes from REVERT + JUMPDEST These two consecutive opcodes, there must be two consecutive opcodes, because in the entire function selector opcode interval, if the number of functions is too large, there will be page-turning logic, which will also appear JUMPDEST This instruction.

3.5, accuracy increase 3- Find function selector

The function selector reads the first 4 bytes of the Calldata of the transaction and matches it with the default contract function signature in the code to help the instruction jump to the memory location where the function method is stored

Let's try a minimal simulation execution

This part is the selector of the two functions store(uint 256) and retrieve() I can figure out that the signature is 2e64cec1, 6057361d

60003560e01c80632e64cec11461003b5780636057361d1461005957

After decompiling, you get the following opcode string, which can be said to be divided into two parts

Part I:

In the compiler, only the function selector part in the contract will get the content of callData, which means to get the function call signature of its CallData, annotated in the following figure.

We can see the effect by simulating the memory pool changes of the EVM

Part Two:

The process of determining whether the value matches that of the selector

1.‍Pass the 4-byte function signature for retrieve() (0x2e64cec1) onto the stack,

2. The EQ opcode pops two variables from the stack area, namely 0x2e64cec1 and 0x6057361d, and checks whether they are equal

3. PUSH2 passes 2 bytes of data (0x003b in this case, 59 in decimal) into the stack, which contains something called the program counter, which specifies the position of the next execution command in the bytecode. Here we set 59, because that's the starting position of the retrieve() bytecode

4. JUMPI stands for "If... ", then jump to..." It pops 2 values from the stack as input, and if the condition is true, the program counter is updated to 59.

This is how the EVM determines the location of the function bytecode it needs to execute based on the function call in the contract.

In reality, this is just a set of simple "if statements" for each function in the contract and where they jump.

4. Program summary

The overall outline is as follows

  1. Each contract address can be accessed through rpc getcode perhaps debug_traceTransaction To obtain the post-deployment data bytecode The VM and ASM libraries in GO are used and obtained after decompilation opcode

  2. Contract on EVM In the operation principle, there will be the following characteristics

    1. adopt REVERT+JUMPDEST These two are consecutive opcode As a distinction of code blocks

    2. The contract must have the function selector function, and that function must also be on the first code block

    3. In the function selector, its function methods are used PUSH4 Act as opcode ,

    4. The selector contains an opcode that appears consecutively PUSH1 00; CALLDATALOAD; PUSH1 e0; SHR; DUP1 The core function is to load callDate data and perform displacement operations, and other syntax will not be generated from the contract function

  3. The corresponding function signature is in eip Is defined, and there are explicit descriptions of required and optional options

4.1. Proof of uniqueness

We can say that we basically achieve an efficient and high-accuracy contract analysis method. Of course, since we have been rigorous for so long, we may wish to be more rigorous. In our above scheme, we distinguish code blocks based on REVER+JUMPDEST, combined with the inevitable CallDate loading and displacement to make unique judgments. Can I implement similar opcode sequences using solidity contracts?

I ran a controlled experiment from solidity There are grammatical aspects, though msg.sig Equal access CallData The method, but after compiling it opcode The implementation methods are different

5. Summary

After this analysis, three days have passed.

Although very detailed, although the daily encounter contract adoption byte4 A contract to maliciously obfuscate your compliance with a standard may be a drop in the ocean.

So, in fact, the ROI of the 3-day investment analysis is very low. But in the endless river of time, the probability of small things, will eventually happen.

Only with the principle of distrust can we go further in the world of web3.

Today, a good friend asked me about a very thoughtful topic, What is the endgame as an industrial KOL? What is your business model?

Although I know that technical articles always have poor traffic, but traffic is not the purpose, I hope that it is always on the original vision:

💡 From the perspective of technology, insight into the key changes in industrial development, share the past experience and future opportunities in industrial development, and provide differentiated help for builders in the new era.

appendix

If you need opcode sample and Go decompilation sample code, you can reply to "contract classification" in the background of the public number.

Dependent libraries:

https://pkg.go.dev/github.com/ethereum/go-ethereum@v1.11.6/core/asm#Compiler.Feed

https://pkg.go.dev/github.com/ethereum/go-ethereum@v1.11.6/core/vm

Principle analysis:

https://whileydave.com/2023/01/04/disassembling-evm-bytecode-the-basics/

https://yanniss.github.io/elipmoc-oopsla22.pdf

https://yanniss.github.io/gigahorse-icse19.pdf

https://www.evm.codes/?fork=shanghai

https://learnblockchain.cn/article/4800

https://learnblockchain.cn/article/3647

Open source project source code reference:

https://github.com/blockchain-etl/ethereum-etl/blob/2da9d050f4ae4fa4e818bbfb22d5cfb5234b2e29/ethereumetl/service/eth_con tract_service.py#L29-L43

https://github.com/ethereum/evmdasmhttps://github.com/tintinweb/ethereum-dasm/blob/master/ethereum_dasm/evmdasm.py#L342

Subscribe to shisi.eth
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.