Vyper: the definitive 0 to 1 guideย ๐Ÿ

You may have heard a lot about Vyper in the past few months, withย Vitalikย tweeting on its progress andย t11sย (alongside other major figures in the ecosystem, s/oย banteg) being bullish. Vyper is a contract-oriented programming language for the Ethereum Virtual Machine that strives to provide superior auditability by making it easier for developers to produce secure and intelligible code. Examples of projects written in Vyper include Uniswap v1 and the first ETH 2.0 deposit contract.

Note: the predecessor of Vyper isย Serpent, which was deprecated (Vitalik tweeted that he considers Serpent to be "outdated tech.")

Vyper can be traced back to the Fall of 2016 (when it was known as Viper but changed name due to this connection), which was created as a PoC by Vitalik with this firstย commit. Over the years, Vyper was picked up by the community, where development was led by the main contributors, such asย iamdefinitelyahuman,ย jacqueswww,ย charles-cooper,ย fubuloubu,ย DavidKnott, andย others. Some of these contributors have been active since Vyperโ€™s beginning, like jacqueswww, fubuloubu, and DavidKnott, whereas other contributors, such as iamdefinitelyahuman and charles-cooper, joined the community in later years (2019).

In Fall of 2019, aย preliminary security auditย was conducted by the ConsenSys Diligence team, which helped identify areas of improvement for the project. A month later, in January 2020, the Ethereum Foundation (EF) published anย R&D blog postย referencing Vyper and the preliminary security audit, stating:

We encourage you to read the report, however, there are two main take-aways.

  1. There are multiple serious bugs in the Vyper compiler.

  2. The codebase has a high level of technical debt which will make addressing these issues complex.

This blog post cast serious doubt on Vyper with strong words in an attempt to promote Rust-Vyper (nowย Fe, which is in alpha) after the EF maintainer assigned to work on Vyper decided to start that project 6 months earlier.

At that time, the codebase was moved out of Ethereum's Github organization into its own organization: vyperlang. The post stated that they "were skeptical that the python codebase was likely to deliver on the idea that Vyper promised" and that "we were sufficiently far along with our Rust based Vyper compiler when the Python Vyper audit was released, and were confident in the direction."

Yet, Vyper's contributors and maintainers (s/o to fubuloubu, iamdefinitelyahuman, and charles-cooper) continued to work and were able to steer in the best direction to bring the best out of the project, reaching today's state (circling back to Vitalik's tweet). Vyper today is robust and production-ready, used widely in the Ethereum ecosystem and other EVM chains.

๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ

With this history lesson over, the article will now focus on introducing you to the Pythonic environment/context and diving into the technical details of Vyper and its compiler. That is the primary goal of this article. Even if you're already familiar with Python or Vyper, this post can be a good resource to use when exploring the codebase and hopefully contributing to theย project.

I. Introduction to Python: Compiling source code into bytecode.

To best understand how Vyper works and compiles code, it is crucial to have a grasp on how Python manages code. We'll dive into the internals of CPython, Python's most popular implementation. By doing so, we'll see the similarities with Vyper at a deeper level.

Let's begin by stating some well-known facts. CPython is a Python interpreter written in C. It's one of the Python implementations, alongside with PyPy, Jython, IronPython, and many others. CPython is distinguished in that it is original, most-maintained, and the most popular.

CPython implements Python, but what is Python? One may simply answer: Python is a programming language. The answer becomes much more nuanced when the same question is put properly: what defines Python? Python, unlike languages like C, doesn't have a formal specification. The thing that comes closest to it is theย Python Language Referenceย which starts with the following words:

While I am trying to be as precise as possible, I chose to use English rather than formal specifications for everything except syntax and lexical analysis. This should make the document more understandable to the average reader, but will leave room for ambiguities. Consequently, if you were coming from Mars and tried to re-implement Python from this document alone, you might have to guess things and in fact you would probably end up implementing quite a different language. On the other hand, if you are using Python and wonder what the precise rules about a particular area of the language are, you should definitely be able to find them here.

Python is not defined by its language reference only and having a full picture gives a deeper understanding of the language. It's much easier to grasp a peculiarity of Python if you're aware of its implementation details.

The big picture

Execution of a Python program roughly consists of three stages:

  1. Initialization

  2. Compilation

  3. Interpretation

During the initialization stage, CPython initializes data structures required to run Python. It also prepares such things as built-in types, configures and loads built-in modules, sets up the import system, and does many other things. This is a very important stage that is often overlooked by the CPython's explorers because of its service nature.

Next comes the compilation stage. CPython is an interpreter, not a compiler, in the sense that it doesn't produce machine code. Interpreters, however, usually translate source code into some intermediate representation before executing it. So does CPython. This translation phase does the same things a typical compiler does: parses a source code and builds an AST (Abstract Syntax Treeโ€”keep this in mind), generates bytecode from an AST, and even performs some bytecode optimizations.

Before looking at the next stage, we need to understand what bytecode is. Bytecode is a series of instructions. Each instruction consists of two bytes: one for an opcode and one for an argument. Consider an example:

def g(x):
    return x + 3

CPython translates the body of the function g() to the following sequence of bytes: [124, 0, 100, 1, 23, 0, 83, 0]. If we run the standard dis module to disassemble it, here's what we'll get:

$ python -m dis example1.py
...
2           0 LOAD_FAST            0 (x)
            2 LOAD_CONST           1 (3)
            4 BINARY_ADD
            6 RETURN_VALUE

The LOAD_FAST opcode corresponds to the byte 124 and has the argument 0. The LOAD_CONST opcode corresponds to the byte 100 and has the argument 1. The BINARY_ADD and RETURN_VALUE instructions are always encoded as (23, 0) and (83, 0), respectively, since they don't need an argument.

At the heart of CPython is a virtual machine that executes bytecode. By looking at the previous example, you might guess how it works. CPython's VM is stack-based. It means that it executes instructions using the stack to store and retrieve data. The LOAD_FAST instruction pushes a local variable onto the stack. LOAD_CONST pushes a constant. BINARY_ADD pops two objects from the stack, adds them up, and pushes the result back. Finally, RETURN_VALUE pops whatever is on the stack and returns the result to its caller. The bytecode execution happens in a giant evaluation loop that runs while there are instructions to execute. It stops to yield a value or if an error occurs.

II. Introduction to Vyper: Features and example smart contracts.

In the previous section, we've outlined the three stages of executing a Python program and how it interacts with CPythonโ€™s VM. The stages are:

  1. initializes CPython

  2. compiles the source code to the module's code object; and

  3. executes the bytecode of the code object.

With this architecture in mind, we can now head to Vyperโ€™sย docs. It is important to have in mind Vyperโ€™s fundamental goals:

  • Security: โ€œIt should be possible and natural to build secure smart-contracts in Vyper.โ€œ

  • Language and compiler simplicity: โ€œThe language and the compiler implementation should strive to be simple.โ€œ

  • Auditability: โ€œVyper code should be maximally human-readable. Furthermore, it should be maximally difficult to write misleading code. Simplicity for the reader is more important than simplicity for the writer, and simplicity for readers with low prior experience with Vyper (and low prior experience with programming in general) is particularly important.โ€œ

This allows for the following features:

  • Bounds and overflow checking: On the arithmetic and array level.

  • Support for signed integers and decimal fixed point numbers

  • Decidability: Ability to always compute precise upper bound on gas cost

  • Strong typing: For custom types and built-in.

  • Small and understandable compiler code

  • Limited support for pure functions: Anything marked with @constant is not allowed to change the state.

However, Vyper does not provide the following features in comparison to Solidity:

  1. Modifiers (defining parts of functions elsewhere)

  2. Class inheritance

  3. Inline assembly

  4. Function overloading

  5. Operator overloading

  6. Recursive calling

  7. Infinite-length loops

  8. Binary fixed point (decimal fixed point is used for its exactness)

Now here I recommend testing out Vyper and building something for fun (or taking a look at implementations). Some of the best resources to try it out and learn are:

  1. Guide to create a Pokemon game

  2. Try Vyper in a hosted Jupyter enviroment

  3. Vyper by Example: Contracts for simple open auctions, blind auctions, safe remote purchases, crowdfunding, voting, or company stock.

  4. Snekmate: Vyper smart contract building blocks

III. Diving into the Vyper compiler: AST, IR nodes, and more.

After writing some smart contracts in Vyper, you may wonder how it turns the code into bytecode for deploymentโ€”and this is where this article gets technical!

First, I recommend skimming over the projectโ€™s READMEs to get a sense of the compiler flow and how all the different components interact. Here is the recommended path:

  1. ๐Ÿ vyper.compiler ๐Ÿ

  2. ๐Ÿ vyper.ast ๐Ÿ

  3. ๐Ÿ vyper.ast.folding ๐Ÿ

  4. ๐Ÿ GlobalContext ๐Ÿ

  5. ๐Ÿ vyper.codegen.module ๐Ÿ

  6. ๐Ÿ vyper.compile_ir ๐Ÿ

To visualize Vyperโ€™s control flow and compiler phases, we are going to see how this very short contract gets compiled:

foo: public(String[100])

@external
def __init__():
	self.foo = "Hello World"

Starting with the vyper.compiler, which is the module that contains โ€œthe main user-facing functionality used to compile Vyper source code and generate various compiler outputsโ€œ. Every time a user runs the vyper command, it is essentially interacting only with the vyper.compiler module itself, which in turn contains the structure used to pass the given input to the different functions to execute each compiler phase. From the vyper.compiler README:

  • _init_.py: Contains the compile_codes function, which is the primary function used for compiling Vyper source code.

  • phases.py: Pure functions for executing each compiler phase, as well as the CompilerData object that fetches and stores compiler output for each phase.

  • output.py: Functions that convert compiler data into the final formats to be outputted to the user.

  • utils.py: Various utility functions related to compilation.

As a general overview, inside the vyper.compiler._init_ file we can find the principal user-facing function for generating compiler output from any given Vyper using the vyper.compiler.compile_codes function.

The @evm_wrapper decorator in the given Vyper source sets the target EVM version to use in vyper.evm.opcodes. Afterwards, a CompilerData object is created for each contract given the source code. As per the README, the @property methods trigger the different compiler phases.

After the source code is compiled, the compiler data is parsed into vyper.compiler.output, which generates the requested output. For our testing contract, this is the generated bytecode running vyper contract.vy:

0x3461010857600b6040527f48656c6c6f20576f726c6400000000000000000000000000000000000000000060605260408051806000556020820180516001555050506100b66100516000396100b66000f36003361161000c5761009e565b60003560e01c346100a45763c2985578811861009c57600436106100a4576020806040528060400160005480825260208201600082601f0160051c600481116100a457801561006e57905b80600101548160051b840152600101818118610057575b505050508051806020830101601f82600003163682375050601f19601f825160200101169050810190506040f35b505b60006000fd5b600080fda165767970657283000307000b005b600080fd

CompilerData: Parsing and converting source code to a Vyper Abstract Syntax Tree.

A CompilerData object is generated from the vyper.compiler._init_ with the given Vyper source and other optional arguments:

  • contract name,

  • interfaces,

  • source id (ID number used to identify this contract in the source map)

  • boolean values (for flexibility/testing):

    • turning off optimizations (default == false),

    • showing gas estimates for the ABI (Application Binary Interface) and Intermediate Representation (IR) output modes (default == false),

    • not adding metadata to bytecode (default == false).

The CompilerData โ€œacts as a wrapper over the pure compiler functions, triggering compilation phases as needed and providing the data for use when generating the final compiler outputs.โ€ For our testing contract, these are the CompilerData arguments:

Contract name: contract.vy
Source code:
foo: public(String[100])

@external
def __init__():
        self.foo = "Hello World"
Interfaces: {}
Source id: 0
No optimize: False
Storage layout override: None
Show gas estimates: False
No bytecode metadata: False

The first step in generating the CompilerData object is to generate a Vyper Abstract Syntax Tree (AST) using the generate_ast function from vyper.compiler.phases inside the compiler module. An AST is a tree data structure that serves as a high-level representation of a source code. Here's an example of a piece of code in Python and a dump of the corresponding AST produced by the standard ast module where each node of the tree denotes a construct occurring in the source code:

x = 123
f(x)
$ python -m ast example1.py
Module(
   body=[
      Assign(
         targets=[
            Name(id='x', ctx=Store())],
         value=Constant(value=123)),
      Expr(
         value=Call(
            func=Name(id='f', ctx=Load()),
            args=[
               Name(id='x', ctx=Load())],
            keywords=[]))],
   type_ignores=[])

An important clarification from Wikipedia (read more there):

The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then statement may be denoted by means of a single node with three branches.

The AST representation is handy to work with because it tells what a source code does, hiding all non-essential information such as indentation, punctuation, and other syntactic features.

Good! With this in mind, the generate_ast function calls the parse_to_ast function from vyper.ast.utils parsing a Vyper source string and generating basic Vyper AST nodes. The parse_to_ast passes the input to the pre_parse function in the vyper.ast.pre_parser which re-formats the input Vyper source string into a Python source string before validating. Some of the re-formatting includes:

  • translating โ€œinterfaceโ€, โ€œstructโ€ and โ€œeventโ€ keywords into Python โ€œclassโ€ keyword

  • validating โ€œ@versionโ€ pragma against current compiler version

  • preventing direct use of Python โ€œclassโ€ keyword and the semi-colon statement separator.

For our contract, the reformatting code is identical to the source code. This reformatted code is then passed into Pythonโ€™s parse function by the parse_to_ast function. Pythonโ€™s parse can be found written in C here: Parser/pegen/parse.c.

The symbols of Pythonโ€™s grammar are tokens and not individual characters. A token is represented by the type, such as NUMBER, NAME, NEWLINE, the value and the position in a source code. CPython distinguishes 63 types of tokens, all of which are listed in Grammar/Tokens. We can see what a tokenized program looks like using the standard tokenize module:

def x_plus(x):
    if x >= 0:
        return x
    return 0
$ python -m tokenize example2.py 
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,3:            NAME           'def'          
1,4-1,10:           NAME           'x_plus'       
1,10-1,11:          OP             '('            
1,11-1,12:          NAME           'x'            
1,12-1,13:          OP             ')'            
1,13-1,14:          OP             ':'            
1,14-1,15:          NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,6:            NAME           'if'           
2,7-2,8:            NAME           'x'            
2,9-2,11:           OP             '>='           
2,12-2,13:          NUMBER         '0'            
2,13-2,14:          OP             ':'            
2,14-2,15:          NEWLINE        '\n'           
3,0-3,8:            INDENT         '        '     
3,8-3,14:           NAME           'return'       
3,15-3,16:          NAME           'x'            
3,16-3,17:          NEWLINE        '\n'           
4,4-4,4:            DEDENT         ''             
4,4-4,10:           NAME           'return'       
4,11-4,12:          NUMBER         '0'            
4,12-4,13:          NEWLINE        '\n'           
5,0-5,0:            DEDENT         ''             
5,0-5,0:            ENDMARKER      ''     

This is how a program looks to the parser. When the parser needs a token, it requests one from the tokenizer. The tokenizer reads one character at a time from the buffer and tries to match the seen prefix with some type of token. Here is the Python-parsed AST for the contract:

Module(
    body=[
        AnnAssign(
            target=Name(id='foo', ctx=Store()),
            annotation=Call(
                func=Name(id='public', ctx=Load()),
                args=[
                    Subscript(
                        value=Name(id='String', ctx=Load()),
                        slice=Constant(value=100),
                        ctx=Load())],
                keywords=[]),
            simple=1),
        FunctionDef(
            name='__init__',
            args=arguments(
                posonlyargs=[],
                args=[],
                kwonlyargs=[],
                kw_defaults=[],
                defaults=[]),
            body=[
                Assign(
                    targets=[
                        Attribute(
                            value=Name(id='self', ctx=Load()),
                            attr='foo',
                            ctx=Store())],
                    value=Constant(value='Hello World'))],
            decorator_list=[
                Name(id='external', ctx=Load())])],
    type_ignores=[])

With this generated Python AST from the reformatted source, the parse_to_ast function now annotates the AST to aid the conversion to the future Vyper AST. The annotation is processed by the annotate_python_ast function in vyper.ast.annotation, which contains the AnnotatingVisitor class. Here is our annotated AST:

Module(
    body=[
        AnnAssign(
            target=Name(id='foo', ctx=Store()),
            annotation=Call(
                func=Name(id='public', ctx=Load()),
                args=[
                    Subscript(
                        value=Name(id='String', ctx=Load()),
                        slice=Index(),
                        ctx=Load())],
                keywords=[]),
            simple=1),
        FunctionDef(
            name='__init__',
            args=arguments(
                posonlyargs=[],
                args=[],
                kwonlyargs=[],
                kw_defaults=[],
                defaults=[]),
            body=[
                Assign(
                    targets=[
                        Attribute(
                            value=Name(id='self', ctx=Load()),
                            attr='foo',
                            ctx=Store())],
                    value=Constant(value='Hello World'))],
            decorator_list=[
                Name(id='external', ctx=Load())])],
    type_ignores=[])

Afterwards, the annotated Python AST structure is converted to a Vyper AST node using the get_node function in vyper.ast.nodes. From ASTโ€™s README, the conversion between a Python Node and a Vyper Node follows these rules:

  • The type of Vyper node is determined from the ast_type field of the Python node.

  • Fields listed in __slots__ (allowed field names for the node) may be included and may have a value.

  • Fields listed in _translated_fields have their key modified prior to being added. This is used to normalize and handle discrepancies in how nodes are structured between different Python versions.

  • Fields listed in _only_empty_fields, if present within the Python AST, must be None or a SyntaxException is raised. This attribute is used to exclude syntax that is valid in Python but not in Vyper.

  • All other fields are ignored.

The VyperNode class is the base class for all Vyper AST nodes, which contains the following attributes: _slots_, description (optional), only_empty_fields (optional), terminus (optional), and translated_fields (optional).

Now we can head back to the origin of the call to generate the Vyper AST from the source code, which is the generate_ast function in vyper.compiler.phases located in the compiler module (same file that contains the CompilerData class). This generate_ast function returns the vyper.ast.nodes.Module, which is the top-level Vyper AST node for our contract.

๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ

Folding: Literal Vyper AST nodes are evaluated and replaced with the resulting values.

The next compiler phase is called AST folding as literal Vyper AST nodes are evaluated and replaced with the resulting values. This phase is mainly run by folding.py which is found in vyper.ast. From the README:

folding.py contains the fold function, a high-level method called to evaluating and replacing literal nodes within the AST. Some examples of literal folding include:

  • arithmetic operations (3+2 becomes 5)

  • references to literal arrays (["foo", "bar"][1] becomes "bar")

  • builtin functions applied to literals (min(1,2) becomes 1)

The process of literal folding includes:

  • Foldable node classes are evaluated via their evaluate method, which attempts to create a new Constant from the content of the given node.

  • Replacement nodes are generated using the from_node class method within the new node class.

  • The modification of the tree is handled by Module.replace_in_tree, which locates the existing node and replaces it with a new one.

From the CompilerData class, the folding is generated by the generate_folded_ast function with this process flow:

  1. Validating literal nodes: Individually validate Vyper AST nodes (calls the validate method of each node to verify that literal nodes do not contain invalid values).

  2. Folding: Performs literal folding operations on a Vyper AST.

  3. Expanding annotated AST: Performs expansion / simplification operations on an annotated Vyper AST. This pass uses annotated type information to modify the AST, simplifying logic and expanding subtrees to reduce the compexity during codegen.

Here is the symbol table for our contracts (A symbol table contains information about code blocks and the symbols used within them. More on this on the last section):

{'storage_layout': {'foo': {'type': 'String[100]', 'slot': 0}}, 'code_layout': {}}

๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ

GlobalContext: Object is generated from the Vyper AST, analyzing and organizing the nodes prior to intermediate representation (IR).

The following step in the compiler process is to generate the GlobalContext object, which analyzes and organizes the nodes prior to Intermediate Representation (IR) generation (term explained in next section). Within vyper.compiler.phases, the generate_global_context function (inside global_ctx function in the CompilerData class) generates a contextualized AST from the Vyper AST, given the previous top-level Vyper AST node (and an optional interface_codes dictionary which represents interfaces imported by contracts). This function calls the get_global_context function from vyper.codegen.global_context.

๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ

Codegen: The contextualized nodes are converted into IR nodes.

Before inspecting the GlobalContext object, letโ€™s first review what IR nodes are. An intermediate representation (IR) is any representation of a program โ€œbetweenโ€ the source and target languages. An IR node is the IR state for the given Vyper source code.

IR nodes for Vyper source code and EVM bytecode.
IR nodes for Vyper source code and EVM bytecode.

Intermediate Representation is used for at least four reasons:

  1. Because translation appears to inherently require analysis and synthesis. Word-for-word translation does not work.

  2. To break the difficult problem of translation into two simpler, more manageable pieces.

  3. To build retargetable compilers:

    • We can build new back ends for an existing front end (making the source language more portable across machines).

    • We can build a new front-end for an existing back end (so a new machine can quickly get a set of compilers for different source languages).

    • We only have to write 2n half-compilers instead of n(nโˆ’1) full compilers. (Though this might be a bit of an exaggeration in practice!)

  4. To perform machine independent optimizations.

Now, what connects a given GlobalContext and the IR node generation is the _ir_output function in the CompilerData class, which fetches both deployment and runtime IR. This function calls the generate_ir_for_module function in vyper.codegen.module which takes a GlobalContext object (contextualized node).

generate_ir_for_module first sorts all the functions in the GlobalContext object to ensure each function is placed after all of its calleesโ€”Otherwise errors such as NameError may rise (local or global reference is not found).

Afterwards, the module generates an ABI signature for all of the functions. On a side note, Vyper also adds a signature to the Vyper bytecode to makes it easier to identify on block explorers what Vyper compiler version was used a certain contract. The signature can be calculated with: vyper (in hex) + major, minor, patch. For example, the Vyper 0.3.7 signature looks like this:

>>> sig = b"\xa1\x65vyper\x83".hex()
>>> version = "00" + "03" + "07"
>>> sig + version
'a165767970657283000307'

Note: thanks to @pcaversaccio for the reference.

The dictionary that contains all the FunctionSignatures for our testing contract is:

{'__init__' <vyper.ast.signatures.function_signature.FunctionSignature object at 0x000001B0DBA6B340>, 'foo': <vyper.ast.signatures.function_signature.FunctionSignature object at 0x000001B0DBA6B520>}

For the __init__ contract signature object:

name: __init__
args: []
return_type: None
mutability: Nonpayable
internal: False
gas_estimate: None
nonreentrant_key: None
funct_ast_code: vyper.ast.nodes.FunctionDef
is_from_json: False

For the foo contract signature object:

name: foo
args: []
return_type: String[100]
mutability: View
internal: False
gas_estimate: None
nonreentrant_key: None
funct_ast_code: vyper.ast.nodes.FunctionDef
is_from_json: False

Note: it is important to differentiate between runtime and initcode (init). runtime bytecode is the code that is stored on-chain that describes a smart contract (including the Vyper signature), but it doesnโ€™t include the constructor logic or constructor parameters of a contract, as they are not relevant to the code that was used to actually create the contract. On the other hand, init is the code that generates the runtime bytecodeโ€”it includes constructor logic and constructor parameters of a smart contract.

Afterward, the generate_ir_for_module separates the runtime and init functions and parses them separately to the _runtime_ir function within the same file (this is done to allow the latter function to organize the runtimeโ€”more later). This _runtime_ir function runs the code generation for all runtime functions, callvalue/calldata checks and method selector routines. To do so, it separates all the input functions into different lists: internal functions, external functions, default functions, and functions exposed in the selector section: regular functions, payables, and nonpaybles. Next up, the function runs the generate_ir_for_function for each given function.

Then, the function does a series of checks to ensure validity. For instance, there is a special case for a contract with no functions (which more likely is a โ€œpure dataโ€ contract with immutables). Another check is to verify all the nonpayble functions when a contract has a nonpayable default function. The function then creates a runtime list filled with the runtime information of the contract (โ€œchecks that calldatasize is at least 4, otherwise calldataload will load zerosโ€ referencing to the Ethereum Yellow Paper).

The _runtime_ir finally returns the runtime and internal_functions_map variables back to the generate_ir_for_module function in vyper.codegen.module.

Here is the runtime return from the _runtime_ir function:

['seq', ['if', ['lt', 'calldatasize', 4], ['goto', 'fallback']], ['with', '_calldata_method_id', ['shr', 224, ['calldataload', 0]], ['seq', ['assert', ['iszero', 'callvalue']], [seq,
  [if,
    [eq, _calldata_method_id, 3264763256 <0xc2985578: foo()>],
    [seq,
      [assert, [ge, calldatasize, 4]],
      [goto, external_foo____common],
      [seq,
        [label,
          external_foo____common,
          var_list,
          [seq,
            pass,
            [seq,
              # Line 1
              /* String[100] */
              [seq,
                pass <fill return buffer external_foo___>,
                seq,
                [exit_to,
                  _sym_external_foo____cleanup,
                  64,
                  /* abi_encode (String[100]) */
                  [with,
                    dyn_ofst,
                    32,
                    [seq,
                      seq,
                      [mstore, [add, 64, 0], dyn_ofst],
                      [set,
                        dyn_ofst,
                        [add,
                          dyn_ofst,
                          [with,
                            dst,
                            [add, 64, dyn_ofst],
                            /* abi_encode String[100] */
                            [seq,
                              [with,
                                len,
                                [sload, 0],
                                [seq,
                                  [mstore, dst, len],
                                  [with,
                                    dst,
                                    [add, dst, 32],
                                    /* copy up to 100 bytes from [add, 0, 1] to [add, dst, 32] */
                                    [repeat,
                                      copy_bytes_ix2,
                                      0,
                                      [div, [add, 31, len], 32],
                                      4,
                                      [mstore,
                                        [add, dst, [mul, copy_bytes_ix2, 32]],
                                        [sload, [add, [add, 0, 1], [mul, copy_bytes_ix2, 1]]]]]]]],
                              /* Zero pad */
                              [with,
                                len,
                                [mload, dst],
                                [with,
                                  dst,
                                  [add, [add, dst, 32], len],
                                  /* mzero */ [calldatacopy, dst, calldatasize, [mod, [sub, 0, len], 32]]]],
                              [ceil32, [add, 32, [mload, dst]]]]]]],
                      dyn_ofst]]]],
              pass]]],
        [label,
          external_foo____cleanup,
          [var_list, ret_ofst, ret_len],
          [seq, pass, [return, ret_ofst, ret_len]]]]]]]]], ['goto', 'fallback'], ['label', 'fallback', ['var_list'], /* Default function */ [revert, 0, 0]]]

The internal_functions return is empty as the contract doesnโ€™t have any internal functions.

With the runtime and internal IR functions, the compiler then generates the IR for the init function using the same generate_ir_for_function function used in the _runtime_ir function. Here is the init IR for our contract:

[seq,
  [goto, external___init______common],
  [seq,
    [label,
      external___init______common,    
      var_list,
      [seq,
        [assert, [iszero, callvalue]],
        pass,
        [seq,
          # Line 5
          /* self.foo = "Hello World" */
          [with,
            src,
            /* "Hello World" */
            [seq,
              [mstore, 64, 11],
              [mstore,
                [add, 64, 32],
                32745724963520459128167607516703083632076522816298193357160756506792738947072],
              64],
            [with,
              len,
              [mload, src],
              [seq,
                [seq, [unique_symbol, sstore_4], [sstore, 0 <self.foo>, len]],
                [with,
                  src,
                  [add, src, 32],
                  [seq,
                    [unique_symbol, sstore_5],
                    [sstore, [add, 0 <self.foo>, 1], [mload, src]]]]]]],
          [seq,
            seq <fill return buffer external___init_____>,
            seq,
            [exit_to, _sym_external___init______cleanup, return_pc]],
          pass]]],
    [label, external___init______cleanup, var_list, [seq, pass]]]]

This IR return is then appended into the deploy_code list, which also contains the runtime and internal functions IR nodes. The deploy_code list is then returned back to the initial generate_ir_nodes function vyper.compiler.phases. At this point, the compiler may run optimizations if they are enabled. For the sake of simplicity, we will not cover IR node optimization despite it being a super interesting areaโ  (I may write about it in the future ;).

๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ

Compile IR: The IR nodes are converted to assembly instructions, and the assembly is converted to EVM bytecode.

As we saw with the IR node returns, the code is now looking much more like pure assembly code used in the EVM. The compiler has to only fully convert the IR nodes into assembly instructions, and then into EVM btecode.

Below the generate_ir_nodes function in vyper.compiler.phases, we can find the generate_assembly function that generates the assembly instructions from a given IR node. This function mainly uses the compile_to_assembly function from vyper.ir.compile_ir, which in hand uses the _compile_to_assembly function (located below in the same file).

The latter function translates all of the content of the respective IR to the correct assembly instructions. For example, the continue keyword in Vyper is translated into the combination of the continuation code and the JUMP instruction. Translating and pushing all of the assembly instructions are more than 450 lines with different if/elif statements for each different code value. If you are curious to know how a specific value is turned to assembly, search here.

The list of assembly instructions for our contract is the following:

['_sym_external___init______common', 'JUMP', '_sym_external___init______common', 'JUMPDEST', 'CALLVALUE', 'ISZERO', 'ISZERO', '_sym_revert1', 'JUMPI', 'PUSH1', 11, 'PUSH1', 64, 'MSTORE', 'PUSH32', 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'PUSH1', 96, 'MSTORE', 'PUSH1', 
64, 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 0, 'SSTORE', 'PUSH1', 32, 'DUP3', 'ADD', 'DUP1', 'MLOAD', 'PUSH1', 1, 'SSTORE', 'POP', 'POP', 'POP', '_sym_external___init______cleanup', 'JUMP', '_sym_external___init______cleanup', 'JUMPDEST', '_sym_subcode_size', '_sym_runtime_begin2', '_mem_deploy_start', 'CODECOPY', '_OFST', '_sym_subcode_size', 0, '_mem_deploy_start', 'RETURN', '_sym_runtime_begin2', 'BLANK', ['_DEPLOY_MEM_OFST_128', 'PUSH1', 3, 'CALLDATASIZE', 'GT', 'ISZERO', 'ISZERO', '_sym_join3', 'JUMPI', '_sym_fallback', 'JUMP', '_sym_join3', 'JUMPDEST', 'PUSH1', 0, 'CALLDATALOAD', 'PUSH1', 224, 'SHR', 'CALLVALUE', 'ISZERO', 'ISZERO', '_sym_revert1', 'JUMPI', 'PUSH4', 194, 152, 85, 120, 'DUP2', 'XOR', 'ISZERO', 'ISZERO', '_sym_join4', 'JUMPI', 'PUSH1', 4, 'CALLDATASIZE', 'LT', 'ISZERO', 'ISZERO', '_sym_revert1', 'JUMPI', '_sym_external_foo____common', 'JUMP', '_sym_external_foo____common', 'JUMPDEST', 'PUSH1', 32, 'DUP1', 'PUSH1', 64, 'MSTORE', 'DUP1', 'PUSH1', 64, 'ADD', 'PUSH1', 0, 'SLOAD', 'DUP1', 'DUP3', 'MSTORE', 
'PUSH1', 32, 'DUP3', 'ADD', 'PUSH1', 0, 'DUP3', 'PUSH1', 31, 'ADD', 'PUSH1', 5, 'SHR', 'PUSH1', 4, 'DUP2', 'GT', '_sym_revert1', 'JUMPI', 'DUP1', 'ISZERO', '_sym_loop_exit7', 'JUMPI', 'SWAP1', '_sym_loop_start5', 'JUMPDEST', 'DUP1', 'PUSH1', 1, 'ADD', 'SLOAD', 'DUP2', 'PUSH1', 5, 'SHL', 'DUP5', 'ADD', 'MSTORE', '_sym_loop_continue6', 'JUMPDEST', 'PUSH1', 1, 'ADD', 'DUP2', 'DUP2', 'XOR', '_sym_loop_start5', 'JUMPI', '_sym_loop_exit7', 'JUMPDEST', 'POP', 'POP', 'POP', 'POP', 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 32, 
'DUP4', 'ADD', 'ADD', 'PUSH1', 31, 'DUP3', 'PUSH1', 0, 'SUB', 'AND', 'CALLDATASIZE', 'DUP3', 'CALLDATACOPY', 'POP', 'POP', 'PUSH1', 31, 'NOT', 'PUSH1', 31, 'DUP3', 'MLOAD', 'PUSH1', 32, 'ADD', 'ADD', 'AND', 'SWAP1', 'POP', 'DUP2', 'ADD', 'SWAP1', 'POP', 'DUP1', 'SWAP1', 'POP', 'PUSH1', 64, '_sym_external_foo____cleanup', 'JUMP', '_sym_external_foo____cleanup', 'JUMPDEST', 'RETURN', '_sym_join4', 'JUMPDEST', 'POP', '_sym_fallback', 'JUMP', '_sym_fallback', 'JUMPDEST', 'PUSH1', 0, 'PUSH1', 0, 'REVERT']]
['CALLVALUE', '_sym_revert1', 'JUMPI', 'PUSH1', 11, 'PUSH1', 64, 'MSTORE', 'PUSH32', 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'PUSH1', 96, 'MSTORE', 'PUSH1', 64, 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 0, 'SSTORE', 'PUSH1', 32, 'DUP3', 'ADD', 'DUP1', 'MLOAD', 'PUSH1', 1, 'SSTORE', 'POP', 'POP', 'POP', '_sym_subcode_size', '_sym_runtime_begin2', '_mem_deploy_start', 'CODECOPY', '_OFST', '_sym_subcode_size', 0, '_mem_deploy_start', 'RETURN', '_sym_runtime_begin2', 'BLANK', ['_DEPLOY_MEM_OFST_128', 'PUSH1', 3, 'CALLDATASIZE', 'GT', '_sym_join3', 'JUMPI', '_sym_fallback', 'JUMP', '_sym_join3', 'JUMPDEST', 'PUSH1', 0, 'CALLDATALOAD', 'PUSH1', 224, 'SHR', 'CALLVALUE', '_sym_revert1', 'JUMPI', 'PUSH4', 194, 152, 85, 120, 'DUP2', 'XOR', '_sym_join4', 'JUMPI', 'PUSH1', 4, 'CALLDATASIZE', 'LT', '_sym_revert1', 'JUMPI', 'PUSH1', 32, 'DUP1', 'PUSH1', 64, 'MSTORE', 'DUP1', 'PUSH1', 64, 'ADD', 'PUSH1', 0, 'SLOAD', 'DUP1', 'DUP3', 'MSTORE', 'PUSH1', 32, 'DUP3', 'ADD', 'PUSH1', 0, 'DUP3', 'PUSH1', 31, 'ADD', 'PUSH1', 5, 'SHR', 'PUSH1', 4, 'DUP2', 'GT', '_sym_revert1', 'JUMPI', 'DUP1', 'ISZERO', '_sym_loop_exit7', 'JUMPI', 'SWAP1', '_sym_loop_start5', 'JUMPDEST', 'DUP1', 'PUSH1', 1, 'ADD', 'SLOAD', 'DUP2', 'PUSH1', 5, 'SHL', 'DUP5', 'ADD', 'MSTORE', 'PUSH1', 1, 'ADD', 'DUP2', 'DUP2', 'XOR', '_sym_loop_start5', 'JUMPI', '_sym_loop_exit7', 'JUMPDEST', 'POP', 'POP', 'POP', 'POP', 'DUP1', 'MLOAD', 'DUP1', 'PUSH1', 32, 'DUP4', 'ADD', 'ADD', 'PUSH1', 31, 'DUP3', 'PUSH1', 0, 'SUB', 'AND', 'CALLDATASIZE', 'DUP3', 'CALLDATACOPY', 'POP', 'POP', 'PUSH1', 31, 'NOT', 'PUSH1', 31, 'DUP3', 'MLOAD', 'PUSH1', 32, 'ADD', 'ADD', 'AND', 'SWAP1', 'POP', 'DUP2', 
'ADD', 'SWAP1', 'POP', 'PUSH1', 64, 'RETURN', '_sym_join4', 'JUMPDEST', 'POP', '_sym_fallback', 'JUMPDEST', 'PUSH1', 0, 'PUSH1', 0, 'REVERT', '_sym_revert1', 'JUMPDEST', 'PUSH1', 0, 'DUP1', 'REVERT'], 'STOP', '_sym_revert1', 'JUMPDEST', 'PUSH1', 0, 'DUP1', 'REVERT']

With this, we are extremely close to having our EVM bytecode; The compiler just needs to turn the list of assembly instructions into bytecode using the generate_bytecode function in vyper.compiler.phases. This function uses the assembly_to_evm function from vyper.ir.compile_ir, which receives as input the following:

  • assembly: list of ASM (x86 assembler*)* instructions,

  • pc_ofst: when constructing the source map, the amount to offset all pcs by (no effect until we add deploy code source map),

  • insert_vyper_signature: whether to append vyper metadata to output (should be true for runtime code)โ€”we referenced to this earlier.

For Vyper to understand the context around certain assembly instructions, it uses a symbol map (table) before generating the EVM bytecode. As defined earlier, a symbol table contains information about code blocks (functions) and the symbols used within them. A symbol table entry contains the properties of a code block, including its name, its type and a dictionary that maps the names of variables used within the block to the flags indicating their scope and usage.

The Vyper compiler does a first single pass (iteration over the enumerated list of assembly introductions) to compile any runtime code and use that to calculate mem_ofst_size, which allows it to use the smallest PUSH instructions possible which can support all memory symbols. For our contract, the start of the runtime code is 0 and the end is 182, which gives a length of 182 and a memory offset size of 1 (with a maximum memory offset of 128).

With the offsets in hand, the compiler does another iteration over the enumerated list of assembly instructions to resolve symbolic locations to actual code locations. Some examples of this task include replacing JUMPDEST locations to a valid destination for JUMP or JUMPI. Now, our symbol map looks like this:

{'_sym_join3': 12, '_sym_loop_start5': 87, '_sym_loop_exit7': 110, '_sym_join4': 156, '_sym_fallback': 158, '_sym_revert1': 164, '_sym_code_end': 182, '_mem_deploy_start': 
None, '_mem_deploy_end': None}

as it contains the different offset marks for the compiler to know where the code ends, loop starts, and more.

Now that all the symbols have been resolved, the compiler now generates the bytecode using the symbol map by again iterating over the enumerated list of the assembly instructions. For our smart contract, we can see how this for loop adds the contractโ€™s self.foo value โ€œHello Worldโ€ to the EVM bytecode:

...
b''
b'4'
b''
b'a'
b'a\x01'
b'4a\x01\x08'
b'4a\x01\x08W'
b'4a\x01\x08W`'
b'4a\x01\x08W`\x0b'
b'4a\x01\x08W`\x0b`'
b'4a\x01\x08W`\x0b`@'
b'4a\x01\x08W`\x0b`@R'
b'4a\x01\x08W`\x0b`@R\x7f'
b'4a\x01\x08W`\x0b`@R\x7fH'
b'4a\x01\x08W`\x0b`@R\x7fHe'
b'4a\x01\x08W`\x0b`@R\x7fHel'
b'4a\x01\x08W`\x0b`@R\x7fHell'
b'4a\x01\x08W`\x0b`@R\x7fHello'
b'4a\x01\x08W`\x0b`@R\x7fHello '
b'4a\x01\x08W`\x0b`@R\x7fHello W'
b'4a\x01\x08W`\x0b`@R\x7fHello Wo'
b'4a\x01\x08W`\x0b`@R\x7fHello Wor'
b'4a\x01\x08W`\x0b`@R\x7fHello Worl'
b'4a\x01\x08W`\x0b`@R\x7fHello World'
b'4a\x01\x08W`\x0b`@R\x7fHello World\x00'
...

This task is assisted by the function itself and other functions like get_opcodes() to find the correct EVM opcode for the given assembly instruction. Finally, this function returns the EVM bytes and line_number_map, which contains the contract's breakpoints.

Finally, the bytecode is built from the bytes using:

f"0x{compiler_data.bytecode.hex()}"

returning our final EVM contract bytecode:

0x3461010857600b6040527f48656c6c6f20576f726c6400000000000000000000000000000000000000000060605260408051806000556020820180516001555050506100b66100516000396100b66000f36003361161000c5761009e565b60003560e01c346100a45763c2985578811861009c57600436106100a4576020806040528060400160005480825260208201600082601f0160051c600481116100a457801561006e57905b80600101548160051b840152600101818118610057575b505050508051806020830101601f82600003163682375050601f19601f825160200101169050810190506040f35b505b60006000fd5b600080fda165767970657283000307000b005b600080fd

๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ๐Ÿ

IV. Wrap up โœ…

Woah, thatโ€™s it! We now have the bytecode to deploy our contract written in Vyper. There are some variations of the compiler process (other than the common one) which results in different types of outputs that this article doesnโ€™t cover. For example, the user may want to return the runtime time bytecode instead of the deployment bytecode or turn on/off gas optimizations or metadata.

Some very specific processes, like IR optimization, were skipped for the sake of the articleโ€™s length. I do recommend checking out Vyperโ€™s READMEs and trying to follow the compiler flow alongside this articleโ€”there are some very interesting tricks and techniques to optimize code and have faster compilation times.

If you are interested in contributing, check out some of the GitHub issues (there is a lot of cool stuff to do).

I hope this was an insightful read! Follow me (@cairoeth) on Twitter to get notified of future interesting posts and articles. This work is licensed under CC BY-SA. Thanks to @big_tech_sux, @_tserg, @fubuloubu and @pcaversaccio for their feedback and comments.

Subscribe to Cairo
Receive the latest updates directly to yourย inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.