today i go over a deep learning experiment modeling trading agents (and MEV bots) with reinforcement learning through MDP.
my journey with machine learning started back in my phd, when i spent two good semesters mathematically and programmatically deriving all the ml classifiers, while researching complex networks (graph) classification. when i was an engineer at yelp and then at apple, i had the opportunity to build several in-house ml projects as well (some of my research can be found in my old blog, “singularity.sh” and in this repository).
however, it was only last year when i started looking at the applications of this knowledge to defi, including experiments using non-canonical features such as crowd sentiment analysis and astronomy.
in this post, i introduce the first pieces of this ongoing side project. the problem we are trying to solve is: “how do we build defi trading agents that maximize cumulative rewards as realized profit”?
i assume you have some background in machine learning, neural networks, and reinforcement learning. if you need to refresh the basics, i suggest the resources linked at my (public) ml research repository (or chatgpt 🥲).
the difference between reinforcement learning and supervised or unsupervised learning is that it uses training information on the agent’s action rather than teaching the correct actions.
in a high level, the elements of a reinforcement learning system are:
the agent (the learning bot);
a model of the environment (anything outside the agent);
a policy, defining the learning agent's behavior at a given time (think a map of perceived states to actions to be taken). it might be stochastic (defining probabilities for each action);
a reward signal, modeling the goal of the agent. it’s the value given by the environment and sent to the agent on each time step, which needs to be maximized;
a value function, specifying the total reward the agent can expect to accumulate over the future (and can be used to find better policies).
solving a reinforcement learning task means having your agent interact with the environment to find a policy that achieves great reward over the long run (maximizing the value function).
markov decision processes (mdp) are a canonical framing of the problem of learning from interaction with an environment to achieve a goal. the environment’s role is to return rewards, i.e., some numerical values that the agent wants to maximize over time through its $CHOICES of actions.
back in statistical and quantum physics, we use markov chains to model random walks of particles and phase transitions. this is because markov chains are a classical formalization of sequential decision-making, where actions influence immediate rewards AND subsequent situations. this means looking at the trade-offs of delayed reward by evaluating feedback and choosing different actions in different situations.
let’s ask our friend what they think:
for our defi agent, a reinforcement learning task can be formulated as a markov decision problem by the following process:
the agent acts in a trading environment, which is a non-stationary pvp game with thousands of agents acting simultaneously. market conditions change, agents join or leave or constantly change their strategies.
in each step of time, the agent gets the current state (S_t) as the input, takes an action (A_t, buy, hold, sell), and receives a reward (R_{t+1}) and the following state (S_{t+1}). the agent can now choose an action based on a policy (pi(S_t)).
the endgame is to find a policy that maximizes the cumulative reward sum (e.g., P&L) over some finite or infinite time.
dynamic programming can be used to calculate optimal policies on mdp, usually by discretizing the state and action spaces.
generalized policy iteration (gpi) is this process of letting policy-evaluation and policy-improvement processes interact through dp.
recall that we can improve a policy by given a value function for that policy, with the following interaction:
one process makes the value function consistent with the current policy (policy evaluation),
and the other process makes the policy greedy for the current value function (policy improvement).
a possible issue, however, is the curse of dimensionality, i.e., when the number of states grows exponentially with the number of state variables in the learning space.
in a high level, our learning agent must sense the state of the environment and then take actions that will change their state.
in practical terms, our defi agent bot is developed and deployed through the following stages:
creating a data engineering infrastructure extracting and cleasing historical blockchain data.
for instance, we might want to retrieve ethereum logs, trace the mempool, or even extract data such as account balance and open limit orders.
this part would also encompass feature selection, i.e., selecting the desired token pairs, dex venues, and what data should be relevant.
next step is to model an initial policy. this can be done by extracting pre-labels from the training data (the historical data up to that point) by running a supervised model training, then studying the features.
the action space could be composed of three actions: buy, hold, and sell. the agent would initially have a deterministic amount of capital on each step.
in reality, however, the agent needs to learn how much money to invest in a continuous action space. in this case, we could introduce limit orders (with variables such as price and quantity, or cancel orders that are not matched).
next stage is to optimize the agent’s policy, on which reward function could be represented by the net profit and loss (i.e., how much profit the bot makes over some time, with or without trading fees). we could also look at the net profit the agent returns if it were to close all of its positions immediately.
moreover, adding real environmental restrictions is part of the optimization process. the simulation could take into account order book network latencies, trading fees, amount of liquidity, etc.
making sure we separate validations and test sets is critical, since overfitting to historical data is a threat to the model.
the next step is backtesting this policy. at this point, we could consider optimizing the parameters by scrutinizing environmental factors: order book liquidity, fee structures, latencies, etc.
finally, we would run paper-trading with real-time new market data, analyzing metrics such as sharpe ratio, maximum drawdown, and value at risk.
paper-trading should prevent overfitting. by learning a model of the environment and running careful rollouts, we could predict potential market reactions and other agents' reactions.
the last step is to go live with the deployment of the strategy and watch the live trading on a decentralized exchange.
in following posts, i might go more in-depth into the code and the stack, but for now, here are a few questions for the anon friend to think:
what token pairs and dex market should our agent trade on?
what are the heaviest features to be training and making predictions on?
can we discover hidden patterns from our feature extraction process?
what should be the agent’s target market, and what protocols and chains should we extract features from?
what are the right state for the agent to trigger a trade? in high-performance trading (hft), decisions are based on market nanosecond signals. although our agent needs to be able to compete with this paradigm, neural nets are slow, up to minutes.
can we train an agent that can transit from bear to bull (and vice-versa)?