The current airdrop system is broken. Sybils try to create multiple addresses try to obtain as many claims to the airdrop as they can. Farmers will game the system to get as much of the airdrop as possible before dumping it all away. For Hop, 16% of those that received the airdrop left it within 30 days of receiving their airdrop. 25% of airdroppers currently do not have the token in their wallet. What if there was a way to detect addresses that may dump a token before it happens through on-chain data? In this experiment, a supervised machine learning model tries to predict if a user will dump a token based on their previous on-chain address history
All code can be found here.
On-chain data was created by using Flipside’s shroomDK. The four main queries I created were around the following:
2 Wallet History: When was their first and last tx before the airdrop snapshot occurred? How many transactions, unique contracts, and distinct addresses interacted with did they have?
Hop History: How many bridges did they have, what was the volume, and how many distinct tokens did they bridge?
Previous Dumping History: Did they dump a token previously? How long did it take for them to dump it? How many previous airdrop tokens did they dump?
Additional feature engineering variables were created such as the percent of txs were hop-related (could only be using the wallet to farm $Hop), frequency of use for the wallet, how long between their first and last hop bridge, and the time between the creation of the wallet and their first hop bridge.
One of the things I will point out is that unfortunately there are a few pieces that the current dataset that I wanted to point out may cause some inaccuracies.
Only Ethereum Hop Bridges were recorded. Ideally, Arbitrum, Optimism, and Polygon would have been included to give a more accurate picture of hop usage
“Dumping” is referred to $Hop leaving the wallet. Users could be mislabeled if they transferred their tokens to another wallet they owned or if they decided to provide liquidity to an AMM like Uniswap. While these are unfortunate, it would be extremely difficult to accurately label these ones so they will be labeled as dumping for now.
Looking at previous airdrops, there were quite a few addresses that dumped previous tokens with SOS, ENS, and Looks being the most popular
Less Than 5000 of these addresses though have dumped more than 1 token
Additionally, we can see that the older wallets for the most part did not dump $Hop within the first 30 days. This may be due to not needing the money (if a wallet is a year or two old then it may have seen significant gains in their crypto assets) or farmers creating new wallets to not be detected for shady on-chain behavior (most wallets are less than aa year old that dumped)
To see if there are any additional insights a means clustering model was developed to find patterns amongst the addresses
The y variables are strongly related to wallets that have dumped a previous airdrop whereas the x variables deal with the time-based variables days since the last transaction before the snapshot, number of tokens used on the bridge, and % of transactions that were hop-related
What’s interesting is that while the dumpers are spread across the 4 groups with varying ratios between the dumpers and holders
Looking at the number of airdrops dumped, the greatest number is still with those that have zero airdrops dumped (may be because they didn’t qualify for a previous airdrop or its a new wallet)
The ratio does decrease however for those that dumped vs did not dump the further amount of airdrops they dumped
Four models were used to classify if an address would dump the token in 30 days. Without any tweaking of the model parameters, the following results were given for the area under the curve of a support vector classifier, random forest, logistic regress, and XG boost
The XG boost model has a solid area under the curve score. Since the dataset is quite unbalanced, over-sampled and undersampled were implemented to see if the results would change. Once this was completed the area under the curve for XG Boost was in the 85%+ range. Overfitting of the models still may be occurring due to the limited dataset size
When looking at a SHAP plot (a plot to help us determine the feature importance of our model) There are a few important trends that can be determined. If the wallet had sent a lot of transactions on Ethereum, then they were MORE likely to dump in addition to those that had little activity on the network. This may have been due to them being LP farmers or traders. The biggest indicator is the number of transactions to where they were receiving farming rewards. Other behaviors such as the age of the wallet Ethereum, and how long they had been using the Hop protocol played a factor. Another piece is the model was picking up something that was missing from the data - other networks. Since the data only looked at Ethereum, if an address did not have transactions on Ethereum before the snapshot on the network they were counted as 0. This may have skewed the results as if someone was very active on Polygon they might have not be accurately represented. Interestingly out of all the previous airdrops, the two most important were DYDX and Bend.
One such interesting example of a farmer was 0x042a135bd342910ad7f67bbda74e3fd4125d1272
Who LP a bit into Uniswap to get the extra fees before swapping out to ETH and USDC
How did this address get so much $Hop? Through farming Taking the earnings they got from $Hop, they deposited into an Across LP another protocol that may have an airdrop for a day before swapping out.
This address has been farming other protocols such as Optimism, Euler, and Across to gain maximum airdrops and tokens before dumping to tokens such as stables and ETH/BTC
The question is this farmer skillful? Should they be rewarded for their actions via airdrops just to dump it? I’m split.
In this analysis, the hop airdrop was analyzed to see if a model could predict who would dump the token within 30 days. Using previous dump history, wallet history, and hope usage history both a clustering model and supervised classification model were created to predict outcomes. Surprisingly the model was able to predict who was going to dump the token based on a few variables such as. This dataset was rather limited (25k addresses and it would be more interesting to do this on a larger dataset) to see if the results still hold up or if this was an example of overfitting and not including other networks. The great thing about blockchain technology is that all of the data is available in some way or another, leading to an endless amount of features being created. As more on-chain data analysis metrics are created, I believe the results of on-chain classification will become better. This could lead to getting rid of users trying to game the system and promoting good on-chain behavior!