Fighting Simple Sybils: Levenshtein Distance

May 19th, 2022

Sybils

Quadratic funding - the mechanism that currently determines the value of Gitcoin grant funding -is inherently vulnerable to Sybil attacks. Sybil attacks are individual humans dividing themselves into multiple “virtual humans” in order to gain additional voting weight. In traditional banking and voting systems, Sybil resistance comes from “KYC” (know-your-customer) which links personal identifying information to some action. In Web3, “KYC” is generaly minimized because it undermines the core ethos of censorship resistance and permissionlessness. This means other methods are required to identify which participants in a grant round are real individual humans, and which are not.

Sybil Strategies

The goal of Sybil defense is to increase the investment of time and money required for an attacker to convice a grant review system that they are > 1 person to the extent that as rational attacker would not do it. Defenders constantly attempt to push this cost up while minimizing their own expenses, while attackers constantly try to pull the attack cost down. The greater the size of the exploitable pool of funds, the higher cost an attacker will be willing to pay. At the same time, extremely low-cost Sybil attacks are often worthwhile for attackers because even a low success rate can still be profitable if the attack cost is sufficiently low. This means that a robust Sybil defense structure requires systems that identify cheap, simple attacks very effectively and efficiently as well as more complex defenses against sophisticated attacks.

Simple Sybils

The simplest, cheapest form of Sybil attack is simply to generate a large number of addresses and try to vote with all of them. Ordinarily, these will be sifted out of the grant review system by human reviewers because they usually fail even basic proof-of-personhood checks. However, there is a substantial cost associated with these human reviews. To optimize the Sybil defense mechanism for high efficacy and low cost, detection of these cheap Sybil attacks must be automated using computationally inexpensive algorithms.

Levenshtein Distance

Levenshtein distance is a concept from information theory that can measure the difference between two text strings. It is commonly used in natural language processing. The Levenshtein distance between string a and string b is the minimum number of edits required to transform a into b. The fewer changes required to change one string to the other, the smaller the Levenshtein distance. The algorithm, in its naive recursive form, is as follows:

where a and b are the strings to be compared, |a| and |b| are the lengths of the strings

In the context of Gitcoin grants, this idea can be applied to account names that are suspiciously similar. Very similar names might indicate low effort Sybil attempts, for example following account names stand out as very similar and worth investigating to ensure they are genuine individual voters:

david1103920
david1103921
david1103922
david1103923
david1103924
david1103925
david1103926

Accounts separated by a single edit might not be quite as obvious as those above. Here are some more examples of some slightly more subtle examples of accounts separated by a single edit:

ahmeddle
bhmeddle
chmeddle

hombre
h0mbre
hombr3

j1lly
j2lly
j3lly

All of these accounts are easily identified with the constraint Levenshtein distance <= 1.

Algorithm performance

The Levenshtein algorithm is computionally expensive in its naive form because it requires recursion over all possible pairs of accounts. However, it can be made performant using matrix algebra. This approach was recently taken by the FDD Sybil defense squad using a list of ~298,000 registered Gitcoin accounts (meaning the Levenshtein distance algorithm was applied to a matrix of 298,00002 elements). The result was a file with 129,297,647 rows indicating all pairs with edit distances less than 4. As the distance threshold decreases, the number of accounts flagged also decreases, but the likelihood that the flagged accounts are Sybils increases. The threshold distance therefore acts as a calibration slider that the Sybil defense squad can use to balance Sybil detection rate against cost of computation.

Use cases

The technique has been applied to the full dataset of all Gitcoin accounts as a way to quantify the bulk number of suspicious accounts, but it could also run as a service that runs periodically to flag accounts that get created in some time window. This could become integrated as a gadget in the grant application workflow.

To demonstrate that the Levenshtein distance really is identifying Sybil-like behaviour, some of the accounts identified by the algorithm can be examined manually. These three accounts were flagged by the Levenshtein algorithm as potential Sybils:

Examine these accounts in more detail using links: Account 1, Account 2, Account 3.

The account names are separated by a single edit (Levenshtein distance ==1). All three accounts have donated a total of precisely $37, divided identically between the same eight projects in the same round. None of the accounts have participated in bounties, hackathons, tips or kudos. All three were created at about the same time and have no activity prior to, or following, those eight votes. These seem to be obvious Sybils that were correctly flagged by the Levenshtein algorithm. Furthermore, the three accounts above were just three of eight accounts that were all separated by a single edit, and all shared the same Gitcoin activity profile and voting record.

Summary

Creating multiple accounts with very similar names is a very lazy attack, and probably easily identified humans in the grant review system. However, identifying them by implementing a Levenshtein distance algorithm to the entire set of accounts in a grant round removes some of the load on the human reviewers and acts as a cost-optimization because the algorithmic apporoach is fast and cheap.