You may have heard a lot about Vyper in the past few months, withย Vitalikย tweeting on its progress andย t11sย (alongside other major figures in the ecosystem, s/oย banteg) being bullish. Vyper is a contract-oriented programming language for the Ethereum Virtual Machine that strives to provide superior auditability by making it easier for developers to produce secure and intelligible code. Examples of projects written in Vyper include Uniswap v1 and the first ETH 2.0 deposit contract.
Note: the predecessor of Vyper isย Serpent, which was deprecated (Vitalik tweeted that he considers Serpent to be "outdated tech.")
Vyper can be traced back to the Fall of 2016 (when it was known as Viper but changed name due to this connection), which was created as a PoC by Vitalik with this firstย commit. Over the years, Vyper was picked up by the community, where development was led by the main contributors, such asย iamdefinitelyahuman,ย jacqueswww,ย charles-cooper,ย fubuloubu,ย DavidKnott, andย others. Some of these contributors have been active since Vyperโs beginning, like jacqueswww, fubuloubu, and DavidKnott, whereas other contributors, such as iamdefinitelyahuman and charles-cooper, joined the community in later years (2019).
In Fall of 2019, aย preliminary security auditย was conducted by the ConsenSys Diligence team, which helped identify areas of improvement for the project. A month later, in January 2020, the Ethereum Foundation (EF) published anย R&D blog postย referencing Vyper and the preliminary security audit, stating:
We encourage you to read the report, however, there are two main take-aways.
There are multiple serious bugs in the Vyper compiler.
The codebase has a high level of technical debt which will make addressing these issues complex.
This blog post cast serious doubt on Vyper with strong words in an attempt to promote Rust-Vyper (nowย Fe, which is in alpha) after the EF maintainer assigned to work on Vyper decided to start that project 6 months earlier.
At that time, the codebase was moved out of Ethereum's Github organization into its own organization: vyperlang. The post stated that they "were skeptical that the python codebase was likely to deliver on the idea that Vyper promised" and that "we were sufficiently far along with our Rust based Vyper compiler when the Python Vyper audit was released, and were confident in the direction."
Yet, Vyper's contributors and maintainers (s/o to fubuloubu, iamdefinitelyahuman, and charles-cooper) continued to work and were able to steer in the best direction to bring the best out of the project, reaching today's state (circling back to Vitalik's tweet). Vyper today is robust and production-ready, used widely in the Ethereum ecosystem and other EVM chains.
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
With this history lesson over, the article will now focus on introducing you to the Pythonic environment/context and diving into the technical details of Vyper and its compiler. That is the primary goal of this article. Even if you're already familiar with Python or Vyper, this post can be a good resource to use when exploring the codebase and hopefully contributing to theย project.
To best understand how Vyper works and compiles code, it is crucial to have a grasp on how Python manages code. We'll dive into the internals of CPython, Python's most popular implementation. By doing so, we'll see the similarities with Vyper at a deeper level.
Let's begin by stating some well-known facts. CPython is a Python interpreter written in C. It's one of the Python implementations, alongside with PyPy, Jython, IronPython, and many others. CPython is distinguished in that it is original, most-maintained, and the most popular.
CPython implements Python, but what is Python? One may simply answer: Python is a programming language. The answer becomes much more nuanced when the same question is put properly: what defines Python? Python, unlike languages like C, doesn't have a formal specification. The thing that comes closest to it is theย Python Language Referenceย which starts with the following words:
While I am trying to be as precise as possible, I chose to use English rather than formal specifications for everything except syntax and lexical analysis. This should make the document more understandable to the average reader, but will leave room for ambiguities. Consequently, if you were coming from Mars and tried to re-implement Python from this document alone, you might have to guess things and in fact you would probably end up implementing quite a different language. On the other hand, if you are using Python and wonder what the precise rules about a particular area of the language are, you should definitely be able to find them here.
Python is not defined by its language reference only and having a full picture gives a deeper understanding of the language. It's much easier to grasp a peculiarity of Python if you're aware of its implementation details.
Execution of a Python program roughly consists of three stages:
Initialization
Compilation
Interpretation
During the initialization stage, CPython initializes data structures required to run Python. It also prepares such things as built-in types, configures and loads built-in modules, sets up the import system, and does many other things. This is a very important stage that is often overlooked by the CPython's explorers because of its service nature.
Next comes the compilation stage. CPython is an interpreter, not a compiler, in the sense that it doesn't produce machine code. Interpreters, however, usually translate source code into some intermediate representation before executing it. So does CPython. This translation phase does the same things a typical compiler does: parses a source code and builds an AST (Abstract Syntax Treeโkeep this in mind), generates bytecode from an AST, and even performs some bytecode optimizations.
Before looking at the next stage, we need to understand what bytecode is. Bytecode is a series of instructions. Each instruction consists of two bytes: one for an opcode and one for an argument. Consider an example:
def g(x):
return x + 3
CPython translates the body of the function g()
to the following sequence of bytes: [124, 0, 100, 1, 23, 0, 83, 0]
. If we run the standard dis
module to disassemble it, here's what we'll get:
$ python -m dis example1.py
...
2 0 LOAD_FAST 0 (x)
2 LOAD_CONST 1 (3)
4 BINARY_ADD
6 RETURN_VALUE
The LOAD_FAST opcode corresponds to the byte 124 and has the argument 0. The LOAD_CONST opcode corresponds to the byte 100 and has the argument 1. The BINARY_ADD and RETURN_VALUE instructions are always encoded as (23, 0) and (83, 0), respectively, since they don't need an argument.
At the heart of CPython is a virtual machine that executes bytecode. By looking at the previous example, you might guess how it works. CPython's VM is stack-based. It means that it executes instructions using the stack to store and retrieve data. The LOAD_FAST instruction pushes a local variable onto the stack. LOAD_CONST pushes a constant. BINARY_ADD pops two objects from the stack, adds them up, and pushes the result back. Finally, RETURN_VALUE pops whatever is on the stack and returns the result to its caller. The bytecode execution happens in a giant evaluation loop that runs while there are instructions to execute. It stops to yield a value or if an error occurs.
In the previous section, we've outlined the three stages of executing a Python program and how it interacts with CPythonโs VM. The stages are:
initializes CPython
compiles the source code to the module's code object; and
executes the bytecode of the code object.
With this architecture in mind, we can now head to Vyperโsย docs. It is important to have in mind Vyperโs fundamental goals:
Security: โIt should be possible and natural to build secure smart-contracts in Vyper.โ
Language and compiler simplicity: โThe language and the compiler implementation should strive to be simple.โ
Auditability: โVyper code should be maximally human-readable. Furthermore, it should be maximally difficult to write misleading code. Simplicity for the reader is more important than simplicity for the writer, and simplicity for readers with low prior experience with Vyper (and low prior experience with programming in general) is particularly important.โ
This allows for the following features:
Bounds and overflow checking: On the arithmetic and array level.
Support for signed integers and decimal fixed point numbers
Decidability: Ability to always compute precise upper bound on gas cost
Strong typing: For custom types and built-in.
Small and understandable compiler code
Limited support for pure functions: Anything marked with @constant
is not allowed to change the state.
However, Vyper does not provide the following features in comparison to Solidity:
Modifiers (defining parts of functions elsewhere)
Class inheritance
Inline assembly
Function overloading
Operator overloading
Recursive calling
Infinite-length loops
Binary fixed point (decimal fixed point is used for its exactness)
Now here I recommend testing out Vyper and building something for fun (or taking a look at implementations). Some of the best resources to try it out and learn are:
Vyper by Example: Contracts for simple open auctions, blind auctions, safe remote purchases, crowdfunding, voting, or company stock.
Snekmate: Vyper smart contract building blocks
After writing some smart contracts in Vyper, you may wonder how it turns the code into bytecode for deploymentโand this is where this article gets technical!
First, I recommend skimming over the projectโs READMEs to get a sense of the compiler flow and how all the different components interact. Here is the recommended path:
To visualize Vyperโs control flow and compiler phases, we are going to see how this very short contract gets compiled:
foo: public(String[100])
@external
def __init__():
self.foo = "Hello World"
Starting with the vyper.compiler, which is the module that contains โthe main user-facing functionality used to compile Vyper source code and generate various compiler outputsโ. Every time a user runs the vyper
command, it is essentially interacting only with the vyper.compiler
module itself, which in turn contains the structure used to pass the given input to the different functions to execute each compiler phase. From the vyper.compiler
README:
_init_.py: Contains the compile_codes function, which is the primary function used for compiling Vyper source code.
phases.py: Pure functions for executing each compiler phase, as well as the CompilerData object that fetches and stores compiler output for each phase.
output.py: Functions that convert compiler data into the final formats to be outputted to the user.
utils.py: Various utility functions related to compilation.
As a general overview, inside the vyper.compiler._init_
file we can find the principal user-facing function for generating compiler output from any given Vyper using the vyper.compiler.compile_codes function.
The @evm_wrapper decorator in the given Vyper source sets the target EVM version to use in vyper.evm.opcodes. Afterwards, a CompilerData object is created for each contract given the source code. As per the README, the @property methods trigger the different compiler phases.
After the source code is compiled, the compiler data is parsed into vyper.compiler.output
, which generates the requested output. For our testing contract, this is the generated bytecode running vyper contract.vy
:
0x3461010857600b6040527f48656c6c6f20576f726c6400000000000000000000000000000000000000000060605260408051806000556020820180516001555050506100b66100516000396100b66000f36003361161000c5761009e565b60003560e01c346100a45763c2985578811861009c57600436106100a4576020806040528060400160005480825260208201600082601f0160051c600481116100a457801561006e57905b80600101548160051b840152600101818118610057575b505050508051806020830101601f82600003163682375050601f19601f825160200101169050810190506040f35b505b60006000fd5b600080fda165767970657283000307000b005b600080fd
A CompilerData object is generated from the vyper.compiler._init_
with the given Vyper source and other optional arguments:
contract name,
interfaces,
source id (ID number used to identify this contract in the source map)
boolean values (for flexibility/testing):
turning off optimizations (default == false),
showing gas estimates for the ABI (Application Binary Interface) and Intermediate Representation (IR) output modes (default == false),
not adding metadata to bytecode (default == false).
The CompilerData
โacts as a wrapper over the pure compiler functions, triggering compilation phases as needed and providing the data for use when generating the final compiler outputs.โ For our testing contract, these are the CompilerData
arguments:
Contract name: contract.vy
Source code:
foo: public(String[100])
@external
def __init__():
self.foo = "Hello World"
Interfaces: {}
Source id: 0
No optimize: False
Storage layout override: None
Show gas estimates: False
No bytecode metadata: False
The first step in generating the CompilerData
object is to generate a Vyper Abstract Syntax Tree (AST) using the generate_ast function from vyper.compiler.phases
inside the compiler module. An AST is a tree data structure that serves as a high-level representation of a source code. Here's an example of a piece of code in Python and a dump of the corresponding AST produced by the standard ast
module where each node of the tree denotes a construct occurring in the source code:
x = 123
f(x)
$ python -m ast example1.py
Module(
body=[
Assign(
targets=[
Name(id='x', ctx=Store())],
value=Constant(value=123)),
Expr(
value=Call(
func=Name(id='f', ctx=Load()),
args=[
Name(id='x', ctx=Load())],
keywords=[]))],
type_ignores=[])
An important clarification from Wikipedia (read more there):
The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then statement may be denoted by means of a single node with three branches.
The AST representation is handy to work with because it tells what a source code does, hiding all non-essential information such as indentation, punctuation, and other syntactic features.
Good! With this in mind, the generate_ast
function calls the parse_to_ast function from vyper.ast.utils
parsing a Vyper source string and generating basic Vyper AST nodes. The parse_to_ast
passes the input to the pre_parse function in the vyper.ast.pre_parser
which re-formats the input Vyper source string into a Python source string before validating. Some of the re-formatting includes:
translating โinterfaceโ, โstructโ and โeventโ keywords into Python โclassโ keyword
validating โ@versionโ pragma against current compiler version
preventing direct use of Python โclassโ keyword and the semi-colon statement separator.
For our contract, the reformatting code is identical to the source code. This reformatted code is then passed into Pythonโs parse function by the parse_to_ast
function. Pythonโs parse can be found written in C here: Parser/pegen/parse.c.
The symbols of Pythonโs grammar are tokens and not individual characters. A token is represented by the type, such as NUMBER
, NAME
, NEWLINE
, the value and the position in a source code. CPython distinguishes 63 types of tokens, all of which are listed in Grammar/Tokens. We can see what a tokenized program looks like using the standard tokenize
module:
def x_plus(x):
if x >= 0:
return x
return 0
$ python -m tokenize example2.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,3: NAME 'def'
1,4-1,10: NAME 'x_plus'
1,10-1,11: OP '('
1,11-1,12: NAME 'x'
1,12-1,13: OP ')'
1,13-1,14: OP ':'
1,14-1,15: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,6: NAME 'if'
2,7-2,8: NAME 'x'
2,9-2,11: OP '>='
2,12-2,13: NUMBER '0'
2,13-2,14: OP ':'
2,14-2,15: NEWLINE '\n'
3,0-3,8: INDENT ' '
3,8-3,14: NAME 'return'
3,15-3,16: NAME 'x'
3,16-3,17: NEWLINE '\n'
4,4-4,4: DEDENT ''
4,4-4,10: NAME 'return'
4,11-4,12: NUMBER '0'
4,12-4,13: NEWLINE '\n'
5,0-5,0: DEDENT ''
5,0-5,0: ENDMARKER ''
This is how a program looks to the parser. When the parser needs a token, it requests one from the tokenizer. The tokenizer reads one character at a time from the buffer and tries to match the seen prefix with some type of token. Here is the Python-parsed AST for the contract:
Module(
body=[
AnnAssign(
target=Name(id='foo', ctx=Store()),
annotation=Call(
func=Name(id='public', ctx=Load()),
args=[
Subscript(
value=Name(id='String', ctx=Load()),
slice=Constant(value=100),
ctx=Load())],
keywords=[]),
simple=1),
FunctionDef(
name='__init__',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
defaults=[]),
body=[
Assign(
targets=[
Attribute(
value=Name(id='self', ctx=Load()),
attr='foo',
ctx=Store())],
value=Constant(value='Hello World'))],
decorator_list=[
Name(id='external', ctx=Load())])],
type_ignores=[])
With this generated Python AST from the reformatted source, the parse_to_ast
function now annotates the AST to aid the conversion to the future Vyper AST. The annotation is processed by the annotate_python_ast function in vyper.ast.annotation
, which contains the AnnotatingVisitor class. Here is our annotated AST:
Module(
body=[
AnnAssign(
target=Name(id='foo', ctx=Store()),
annotation=Call(
func=Name(id='public', ctx=Load()),
args=[
Subscript(
value=Name(id='String', ctx=Load()),
slice=Index(),
ctx=Load())],
keywords=[]),
simple=1),
FunctionDef(
name='__init__',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
defaults=[]),
body=[
Assign(
targets=[
Attribute(
value=Name(id='self', ctx=Load()),
attr='foo',
ctx=Store())],
value=Constant(value='Hello World'))],
decorator_list=[
Name(id='external', ctx=Load())])],
type_ignores=[])
Afterwards, the annotated Python AST structure is converted to a Vyper AST node using the get_node function in vyper.ast.nodes
. From ASTโs README, the conversion between a Python Node and a Vyper Node follows these rules:
The type of Vyper node is determined from the ast_type
field of the Python node.
Fields listed in __slots__
(allowed field names for the node) may be included and may have a value.
Fields listed in _translated_fields
have their key modified prior to being added. This is used to normalize and handle discrepancies in how nodes are structured between different Python versions.
Fields listed in _only_empty_fields
, if present within the Python AST, must be None
or a SyntaxException
is raised. This attribute is used to exclude syntax that is valid in Python but not in Vyper.
All other fields are ignored.
The VyperNode class is the base class for all Vyper AST nodes, which contains the following attributes: _slots_
, description
(optional), only_empty_fields
(optional), terminus
(optional), and translated_fields
(optional).
Now we can head back to the origin of the call to generate the Vyper AST from the source code, which is the generate_ast
function in vyper.compiler.phases
located in the compiler module (same file that contains the CompilerData
class). This generate_ast
function returns the vyper.ast.nodes.Module
, which is the top-level Vyper AST node for our contract.
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
The next compiler phase is called AST folding as literal Vyper AST nodes are evaluated and replaced with the resulting values. This phase is mainly run by folding.py which is found in vyper.ast
. From the README:
folding.py contains the fold function, a high-level method called to evaluating and replacing literal nodes within the AST. Some examples of literal folding include:
arithmetic operations (3+2 becomes 5)
references to literal arrays (["foo", "bar"][1] becomes "bar")
builtin functions applied to literals (min(1,2) becomes 1)
The process of literal folding includes:
Foldable node classes are evaluated via their evaluate method, which attempts to create a new Constant from the content of the given node.
Replacement nodes are generated using the from_node class method within the new node class.
The modification of the tree is handled by Module.replace_in_tree, which locates the existing node and replaces it with a new one.
From the CompilerData
class, the folding is generated by the generate_folded_ast function with this process flow:
Validating literal nodes: Individually validate Vyper AST nodes (calls the validate method of each node to verify that literal nodes do not contain invalid values).
Folding: Performs literal folding operations on a Vyper AST.
Expanding annotated AST: Performs expansion / simplification operations on an annotated Vyper AST. This pass uses annotated type information to modify the AST, simplifying logic and expanding subtrees to reduce the compexity during codegen.
Here is the symbol table for our contracts (A symbol table contains information about code blocks and the symbols used within them. More on this on the last section):
{'storage_layout': {'foo': {'type': 'String[100]', 'slot': 0}}, 'code_layout': {}}
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
The following step in the compiler process is to generate the GlobalContext object, which analyzes and organizes the nodes prior to Intermediate Representation (IR) generation (term explained in next section). Within vyper.compiler.phases
, the generate_global_context function (inside global_ctx function in the CompilerData
class) generates a contextualized AST from the Vyper AST, given the previous top-level Vyper AST node (and an optional interface_codes
dictionary which represents interfaces imported by contracts). This function calls the get_global_context function from vyper.codegen.global_context
.
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
Before inspecting the GlobalContext
object, letโs first review what IR nodes are. An intermediate representation (IR) is any representation of a program โbetweenโ the source and target languages. An IR node is the IR state for the given Vyper source code.
Intermediate Representation is used for at least four reasons:
Because translation appears to inherently require analysis and synthesis. Word-for-word translation does not work.
To break the difficult problem of translation into two simpler, more manageable pieces.
To build retargetable compilers:
We can build new back ends for an existing front end (making the source language more portable across machines).
We can build a new front-end for an existing back end (so a new machine can quickly get a set of compilers for different source languages).
We only have to write 2n half-compilers instead of n(nโ1) full compilers. (Though this might be a bit of an exaggeration in practice!)
To perform machine independent optimizations.
Now, what connects a given GlobalContext
and the IR node generation is the _ir_output function in the CompilerData
class, which fetches both deployment and runtime IR. This function calls the generate_ir_for_module function in vyper.codegen.module
which takes a GlobalContext
object (contextualized node).
generate_ir_for_module
first sorts all the functions in the GlobalContext
object to ensure each function is placed after all of its calleesโOtherwise errors such as NameError
may rise (local or global reference is not found).
Afterwards, the module generates an ABI signature for all of the functions. On a side note, Vyper also adds a signature to the Vyper bytecode to makes it easier to identify on block explorers what Vyper compiler version was used a certain contract. The signature can be calculated with: vyper
(in hex) + major, minor, patch. For example, the Vyper 0.3.7 signature looks like this:
>>> sig = b"\xa1\x65vyper\x83".hex()
>>> version = "00" + "03" + "07"
>>> sig + version
'a165767970657283000307'
Note: thanks to @pcaversaccio for the reference.
The dictionary that contains all the FunctionSignatures
for our testing contract is:
{'__init__' <vyper.ast.signatures.function_signature.FunctionSignature object at 0x000001B0DBA6B340>, 'foo': <vyper.ast.signatures.function_signature.FunctionSignature object at 0x000001B0DBA6B520>}
For the __init__
contract signature object:
name: __init__
args: []
return_type: None
mutability: Nonpayable
internal: False
gas_estimate: None
nonreentrant_key: None
funct_ast_code: vyper.ast.nodes.FunctionDef
is_from_json: False
For the foo
contract signature object:
name: foo
args: []
return_type: String[100]
mutability: View
internal: False
gas_estimate: None
nonreentrant_key: None
funct_ast_code: vyper.ast.nodes.FunctionDef
is_from_json: False
Note: it is important to differentiate between runtime
and initcode
(init
). runtime
bytecode is the code that is stored on-chain that describes a smart contract (including the Vyper signature), but it doesnโt include the constructor logic or constructor parameters of a contract, as they are not relevant to the code that was used to actually create the contract. On the other hand, init
is the code that generates the runtime bytecodeโit includes constructor logic and constructor parameters of a smart contract.
Afterward, the generate_ir_for_module
separates the runtime
and init
functions and parses them separately to the _runtime_ir function within the same file (this is done to allow the latter function to organize the runtimeโmore later). This _runtime_ir
function runs the code generation for all runtime functions, callvalue/calldata checks and method selector routines. To do so, it separates all the input functions into different lists: internal functions, external functions, default functions, and functions exposed in the selector section: regular functions, payables, and nonpaybles. Next up, the function runs the generate_ir_for_function for each given function.
Then, the function does a series of checks to ensure validity. For instance, there is a special case for a contract with no functions (which more likely is a โpure dataโ contract with immutables). Another check is to verify all the nonpayble functions when a contract has a nonpayable default function. The function then creates a runtime list filled with the runtime information of the contract (โchecks that calldatasize is at least 4, otherwise calldataload will load zerosโ referencing to the Ethereum Yellow Paper).
The _runtime_ir
finally returns the runtime
and internal_functions_map
variables back to the generate_ir_for_module
function in vyper.codegen.module
.
Here is the runtime return from the _runtime_ir
function:
['seq', ['if', ['lt', 'calldatasize', 4], ['goto', 'fallback']], ['with', '_calldata_method_id', ['shr', 224, ['calldataload', 0]], ['seq', ['assert', ['iszero', 'callvalue']], [seq,
[if,
[eq, _calldata_method_id, 3264763256 <0xc2985578: foo()>],
[seq,
[assert, [ge, calldatasize, 4]],
[goto, external_foo____common],
[seq,
[label,
external_foo____common,
var_list,
[seq,
pass,
[seq,
# Line 1
/* String[100] */
[seq,
pass <fill return buffer external_foo___>,
seq,
[exit_to,
_sym_external_foo____cleanup,
64,
/* abi_encode (String[100]) */
[with,
dyn_ofst,
32,
[seq,
seq,
[mstore, [add, 64, 0], dyn_ofst],
[set,
dyn_ofst,
[add,
dyn_ofst,
[with,
dst,
[add, 64, dyn_ofst],
/* abi_encode String[100] */
[seq,
[with,
len,
[sload, 0],
[seq,
[mstore, dst, len],
[with,
dst,
[add, dst, 32],
/* copy up to 100 bytes from [add, 0, 1] to [add, dst, 32] */
[repeat,
copy_bytes_ix2,
0,
[div, [add, 31, len], 32],
4,
[mstore,
[add, dst, [mul, copy_bytes_ix2, 32]],
[sload, [add, [add, 0, 1], [mul, copy_bytes_ix2, 1]]]]]]]],
/* Zero pad */
[with,
len,
[mload, dst],
[with,
dst,
[add, [add, dst, 32], len],
/* mzero */ [calldatacopy, dst, calldatasize, [mod, [sub, 0, len], 32]]]],
[ceil32, [add, 32, [mload, dst]]]]]]],
dyn_ofst]]]],
pass]]],
[label,
external_foo____cleanup,
[var_list, ret_ofst, ret_len],
[seq, pass, [return, ret_ofst, ret_len]]]]]]]]], ['goto', 'fallback'], ['label', 'fallback', ['var_list'], /* Default function */ [revert, 0, 0]]]
The internal_functions
return is empty as the contract doesnโt have any internal functions.
With the runtime and internal IR functions, the compiler then generates the IR for the init function using the same generate_ir_for_function
function used in the _runtime_ir
function. Here is the init IR for our contract:
[seq,
[goto, external___init______common],
[seq,
[label,
external___init______common,
var_list,
[seq,
[assert, [iszero, callvalue]],
pass,
[seq,
# Line 5
/* self.foo = "Hello World" */
[with,
src,
/* "Hello World" */
[seq,
[mstore, 64, 11],
[mstore,
[add, 64, 32],
32745724963520459128167607516703083632076522816298193357160756506792738947072],
64],
[with,
len,
[mload, src],
[seq,
[seq, [unique_symbol, sstore_4], [sstore, 0 <self.foo>, len]],
[with,
src,
[add, src, 32],
[seq,
[unique_symbol, sstore_5],
[sstore, [add, 0 <self.foo>, 1], [mload, src]]]]]]],
[seq,
seq <fill return buffer external___init_____>,
seq,
[exit_to, _sym_external___init______cleanup, return_pc]],
pass]]],
[label, external___init______cleanup, var_list, [seq, pass]]]]
This IR return is then appended into the deploy_code
list, which also contains the runtime and internal functions IR nodes. The deploy_code
list is then returned back to the initial generate_ir_nodes
function vyper.compiler.phases
. At this point, the compiler may run optimizations if they are enabled. For the sake of simplicity, we will not cover IR node optimization despite it being a super interesting areaโ (I may write about it in the future ;).
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
As we saw with the IR node returns, the code is now looking much more like pure assembly code used in the EVM. The compiler has to only fully convert the IR nodes into assembly instructions, and then into EVM btecode.
Below the generate_ir_nodes
function in vyper.compiler.phases
, we can find the generate_assembly function that generates the assembly instructions from a given IR node. This function mainly uses the compile_to_assembly function from vyper.ir.compile_ir
, which in hand uses the _compile_to_assembly function (located below in the same file).
The latter function translates all of the content of the respective IR to the correct assembly instructions. For example, the continue
keyword in Vyper is translated into the combination of the continuation code and the JUMP
instruction. Translating and pushing all of the assembly instructions are more than 450 lines with different if/elif statements for each different code value. If you are curious to know how a specific value is turned to assembly, search here.
The list of assembly instructions for our contract is the following:
['_sym_external___init______common', 'JUMP', '_sym_external___init______common', 'JUMPDEST', 'CALLVALUE', 'ISZERO', 'ISZERO', '_sym_revert1', 'JUMPI', 'PUSH1', 11, 'PUSH1', 64, 'MSTORE', 'PUSH32', 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'PUSH1', 96, 'MSTORE', 'PUSH1',
64, 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 0, 'SSTORE', 'PUSH1', 32, 'DUP3', 'ADD', 'DUP1', 'MLOAD', 'PUSH1', 1, 'SSTORE', 'POP', 'POP', 'POP', '_sym_external___init______cleanup', 'JUMP', '_sym_external___init______cleanup', 'JUMPDEST', '_sym_subcode_size', '_sym_runtime_begin2', '_mem_deploy_start', 'CODECOPY', '_OFST', '_sym_subcode_size', 0, '_mem_deploy_start', 'RETURN', '_sym_runtime_begin2', 'BLANK', ['_DEPLOY_MEM_OFST_128', 'PUSH1', 3, 'CALLDATASIZE', 'GT', 'ISZERO', 'ISZERO', '_sym_join3', 'JUMPI', '_sym_fallback', 'JUMP', '_sym_join3', 'JUMPDEST', 'PUSH1', 0, 'CALLDATALOAD', 'PUSH1', 224, 'SHR', 'CALLVALUE', 'ISZERO', 'ISZERO', '_sym_revert1', 'JUMPI', 'PUSH4', 194, 152, 85, 120, 'DUP2', 'XOR', 'ISZERO', 'ISZERO', '_sym_join4', 'JUMPI', 'PUSH1', 4, 'CALLDATASIZE', 'LT', 'ISZERO', 'ISZERO', '_sym_revert1', 'JUMPI', '_sym_external_foo____common', 'JUMP', '_sym_external_foo____common', 'JUMPDEST', 'PUSH1', 32, 'DUP1', 'PUSH1', 64, 'MSTORE', 'DUP1', 'PUSH1', 64, 'ADD', 'PUSH1', 0, 'SLOAD', 'DUP1', 'DUP3', 'MSTORE',
'PUSH1', 32, 'DUP3', 'ADD', 'PUSH1', 0, 'DUP3', 'PUSH1', 31, 'ADD', 'PUSH1', 5, 'SHR', 'PUSH1', 4, 'DUP2', 'GT', '_sym_revert1', 'JUMPI', 'DUP1', 'ISZERO', '_sym_loop_exit7', 'JUMPI', 'SWAP1', '_sym_loop_start5', 'JUMPDEST', 'DUP1', 'PUSH1', 1, 'ADD', 'SLOAD', 'DUP2', 'PUSH1', 5, 'SHL', 'DUP5', 'ADD', 'MSTORE', '_sym_loop_continue6', 'JUMPDEST', 'PUSH1', 1, 'ADD', 'DUP2', 'DUP2', 'XOR', '_sym_loop_start5', 'JUMPI', '_sym_loop_exit7', 'JUMPDEST', 'POP', 'POP', 'POP', 'POP', 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 32,
'DUP4', 'ADD', 'ADD', 'PUSH1', 31, 'DUP3', 'PUSH1', 0, 'SUB', 'AND', 'CALLDATASIZE', 'DUP3', 'CALLDATACOPY', 'POP', 'POP', 'PUSH1', 31, 'NOT', 'PUSH1', 31, 'DUP3', 'MLOAD', 'PUSH1', 32, 'ADD', 'ADD', 'AND', 'SWAP1', 'POP', 'DUP2', 'ADD', 'SWAP1', 'POP', 'DUP1', 'SWAP1', 'POP', 'PUSH1', 64, '_sym_external_foo____cleanup', 'JUMP', '_sym_external_foo____cleanup', 'JUMPDEST', 'RETURN', '_sym_join4', 'JUMPDEST', 'POP', '_sym_fallback', 'JUMP', '_sym_fallback', 'JUMPDEST', 'PUSH1', 0, 'PUSH1', 0, 'REVERT']]
['CALLVALUE', '_sym_revert1', 'JUMPI', 'PUSH1', 11, 'PUSH1', 64, 'MSTORE', 'PUSH32', 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'PUSH1', 96, 'MSTORE', 'PUSH1', 64, 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 0, 'SSTORE', 'PUSH1', 32, 'DUP3', 'ADD', 'DUP1', 'MLOAD', 'PUSH1', 1, 'SSTORE', 'POP', 'POP', 'POP', '_sym_subcode_size', '_sym_runtime_begin2', '_mem_deploy_start', 'CODECOPY', '_OFST', '_sym_subcode_size', 0, '_mem_deploy_start', 'RETURN', '_sym_runtime_begin2', 'BLANK', ['_DEPLOY_MEM_OFST_128', 'PUSH1', 3, 'CALLDATASIZE', 'GT', '_sym_join3', 'JUMPI', '_sym_fallback', 'JUMP', '_sym_join3', 'JUMPDEST', 'PUSH1', 0, 'CALLDATALOAD', 'PUSH1', 224, 'SHR', 'CALLVALUE', '_sym_revert1', 'JUMPI', 'PUSH4', 194, 152, 85, 120, 'DUP2', 'XOR', '_sym_join4', 'JUMPI', 'PUSH1', 4, 'CALLDATASIZE', 'LT', '_sym_revert1', 'JUMPI', 'PUSH1', 32, 'DUP1', 'PUSH1', 64, 'MSTORE', 'DUP1', 'PUSH1', 64, 'ADD', 'PUSH1', 0, 'SLOAD', 'DUP1', 'DUP3', 'MSTORE', 'PUSH1', 32, 'DUP3', 'ADD', 'PUSH1', 0, 'DUP3', 'PUSH1', 31, 'ADD', 'PUSH1', 5, 'SHR', 'PUSH1', 4, 'DUP2', 'GT', '_sym_revert1', 'JUMPI', 'DUP1', 'ISZERO', '_sym_loop_exit7', 'JUMPI', 'SWAP1', '_sym_loop_start5', 'JUMPDEST', 'DUP1', 'PUSH1', 1, 'ADD', 'SLOAD', 'DUP2', 'PUSH1', 5, 'SHL', 'DUP5', 'ADD', 'MSTORE', 'PUSH1', 1, 'ADD', 'DUP2', 'DUP2', 'XOR', '_sym_loop_start5', 'JUMPI', '_sym_loop_exit7', 'JUMPDEST', 'POP', 'POP', 'POP', 'POP', 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 32, 'DUP4', 'ADD', 'ADD', 'PUSH1', 31, 'DUP3', 'PUSH1', 0, 'SUB', 'AND', 'CALLDATASIZE', 'DUP3', 'CALLDATACOPY', 'POP', 'POP', 'PUSH1', 31, 'NOT', 'PUSH1', 31, 'DUP3', 'MLOAD', 'PUSH1', 32, 'ADD', 'ADD', 'AND', 'SWAP1', 'POP', 'DUP2',
'ADD', 'SWAP1', 'POP', 'PUSH1', 64, 'RETURN', '_sym_join4', 'JUMPDEST', 'POP', '_sym_fallback', 'JUMPDEST', 'PUSH1', 0, 'PUSH1', 0, 'REVERT', '_sym_revert1', 'JUMPDEST', 'PUSH1', 0, 'DUP1', 'REVERT'], 'STOP', '_sym_revert1', 'JUMPDEST', 'PUSH1', 0, 'DUP1', 'REVERT']
With this, we are extremely close to having our EVM bytecode; The compiler just needs to turn the list of assembly instructions into bytecode using the generate_bytecode function in vyper.compiler.phases
. This function uses the assembly_to_evm function from vyper.ir.compile_ir
, which receives as input the following:
assembly
: list of ASM (x86 assembler*)* instructions,
pc_ofst
: when constructing the source map, the amount to offset all pcs by (no effect until we add deploy code source map),
insert_vyper_signature
: whether to append vyper metadata to output (should be true for runtime code)โwe referenced to this earlier.
For Vyper to understand the context around certain assembly instructions, it uses a symbol map (table) before generating the EVM bytecode. As defined earlier, a symbol table contains information about code blocks (functions) and the symbols used within them. A symbol table entry contains the properties of a code block, including its name, its type and a dictionary that maps the names of variables used within the block to the flags indicating their scope and usage.
The Vyper compiler does a first single pass (iteration over the enumerated list of assembly introductions) to compile any runtime code and use that to calculate mem_ofst_size
, which allows it to use the smallest PUSH
instructions possible which can support all memory symbols. For our contract, the start of the runtime code is 0 and the end is 182, which gives a length of 182 and a memory offset size of 1 (with a maximum memory offset of 128).
With the offsets in hand, the compiler does another iteration over the enumerated list of assembly instructions to resolve symbolic locations to actual code locations. Some examples of this task include replacing JUMPDEST
locations to a valid destination for JUMP
or JUMPI
. Now, our symbol map looks like this:
{'_sym_join3': 12, '_sym_loop_start5': 87, '_sym_loop_exit7': 110, '_sym_join4': 156, '_sym_fallback': 158, '_sym_revert1': 164, '_sym_code_end': 182, '_mem_deploy_start':
None, '_mem_deploy_end': None}
as it contains the different offset marks for the compiler to know where the code ends, loop starts, and more.
Now that all the symbols have been resolved, the compiler now generates the bytecode using the symbol map by again iterating over the enumerated list of the assembly instructions. For our smart contract, we can see how this for loop adds the contractโs self.foo
value โHello Worldโ to the EVM bytecode:
...
b''
b'4'
b''
b'a'
b'a\x01'
b'4a\x01\x08'
b'4a\x01\x08W'
b'4a\x01\x08W`'
b'4a\x01\x08W`\x0b'
b'4a\x01\x08W`\x0b`'
b'4a\x01\x08W`\x0b`@'
b'4a\x01\x08W`\x0b`@R'
b'4a\x01\x08W`\x0b`@R\x7f'
b'4a\x01\x08W`\x0b`@R\x7fH'
b'4a\x01\x08W`\x0b`@R\x7fHe'
b'4a\x01\x08W`\x0b`@R\x7fHel'
b'4a\x01\x08W`\x0b`@R\x7fHell'
b'4a\x01\x08W`\x0b`@R\x7fHello'
b'4a\x01\x08W`\x0b`@R\x7fHello '
b'4a\x01\x08W`\x0b`@R\x7fHello W'
b'4a\x01\x08W`\x0b`@R\x7fHello Wo'
b'4a\x01\x08W`\x0b`@R\x7fHello Wor'
b'4a\x01\x08W`\x0b`@R\x7fHello Worl'
b'4a\x01\x08W`\x0b`@R\x7fHello World'
b'4a\x01\x08W`\x0b`@R\x7fHello World\x00'
...
This task is assisted by the function itself and other functions like get_opcodes() to find the correct EVM opcode for the given assembly instruction. Finally, this function returns the EVM bytes
and line_number_map
, which contains the contract's breakpoints.
Finally, the bytecode is built from the bytes using:
f"0x{compiler_data.bytecode.hex()}"
returning our final EVM contract bytecode:
0x3461010857600b6040527f48656c6c6f20576f726c6400000000000000000000000000000000000000000060605260408051806000556020820180516001555050506100b66100516000396100b66000f36003361161000c5761009e565b60003560e01c346100a45763c2985578811861009c57600436106100a4576020806040528060400160005480825260208201600082601f0160051c600481116100a457801561006e57905b80600101548160051b840152600101818118610057575b505050508051806020830101601f82600003163682375050601f19601f825160200101169050810190506040f35b505b60006000fd5b600080fda165767970657283000307000b005b600080fd
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
Woah, thatโs it! We now have the bytecode to deploy our contract written in Vyper. There are some variations of the compiler process (other than the common one) which results in different types of outputs that this article doesnโt cover. For example, the user may want to return the runtime time bytecode instead of the deployment bytecode or turn on/off gas optimizations or metadata.
Some very specific processes, like IR optimization, were skipped for the sake of the articleโs length. I do recommend checking out Vyperโs READMEs and trying to follow the compiler flow alongside this articleโthere are some very interesting tricks and techniques to optimize code and have faster compilation times.
If you are interested in contributing, check out some of the GitHub issues (there is a lot of cool stuff to do).
I hope this was an insightful read! Follow me (@cairoeth) on Twitter to get notified of future interesting posts and articles. This work is licensed under CC BY-SA. Thanks to @big_tech_sux, @_tserg, @fubuloubu and @pcaversaccio for their feedback and comments.