Retrieval System Constraints

Navigating the constraints of retrieval systems

Building incentivization systems to ensure data availability has turned out to be a tricky problem. This document is meant to describe the building blocks that compose such a system, and the constraints that have been observed in their design. One of the major hurdles that the Filecoin decentralized storage network continues to grapple with is how to align the network with user understandable retrieval properties.

Definitions

Data Availability

We use data availability here not to mean the presence of data on a sufficient number of nodes to infer that it has been "made available", but rather to describe the property that if a client wants to access a piece of data from a provider that they can get that data under pre-negotiated and understood terms.

Retrieval

We use the term retrieval both to refer to a single instance of a client fetching a piece of data from a provider, and for a general system allowing data to be retrieved by clients. A retrieval system can be broken into a set of individual components:

  • Clients, or end users attempt to get data, either directly in a browser or through an application. The user agent will impact the quality of service - because of factors including whether previous state can be preserved between retrieval attempts, and what level of complexity can be supported

  • Providers, holding data that clients want to retrieve. Data storage is often segmented by size and access pattern. A movie that is streamed linearly at infrequent intervals may be stored on a spinning disk, while a small image asset on a popular home page would more likely be stored in memory for faster delivery.

  • Content Routing - which refers to the subsystem of determining where content is located - what provider(s) should be contacted by the client? - In current web CDNs, this is handled by DNS, while in p2p systems it is often handled by DHTs.

  • Reputation describes generally a feedback loop that involves observability into the system. If a component is behaving poorly, it needs to be avoided in order to maximize overall performance. Visibility into the system informs those decision of how to route requests and data.

Capacity

We often want to talk about how much bandwidth is available as a node in a networked system.

We have a basic use of this term that may be familiar from the internet service available around the world. When you purchase a '1Gbit' internet service, you are paying for transit capacity. Transit capacity means that the provider is offering to transit (send) up to 1gbit/s of traffic from you to the rest of the internet.

There are however, some important caveats.

  • There may be capacity / over-subscription limitations that mean that a given connection you make does not operate at 1gbps.

  • There may be limiting relationships multiple hops away from you, such that even if you have capacity for your continent, you’re unlikely to observe that same capacity for intercontinental traffic.

A scenario to consider that is not uncommon, is that you could have a gbit line from your server to the interconnection point of your transit ISP, as could their other customers. Datacenter routers commonly have "full-plane switching", meaning there would not be a capacity limitation in connections to other nodes within the data center. (full-plane switching here means that the router has enough internal bandwidth to provide full line-rate connectivity between any pair of ports.) While there may be enough bandwidth for connections within the datacenter the ISP will estimate overall load to provision their upstream capacity from the datacenter to other parts of the internet. As such, not all customers would be able to simultaneously receive their full provisioned capacity.

In this scenario one would receive a gbit result on a speed test to a service like speedtest.net or even netflix's fast.com. Both of these services look at the bandwidth they can measure at the edge of a server or customers "local ISP". Despite that, no customer would be able to receive gbit speeds from the server.

Reputation

We use reputation in this context to talk about a relative measure of 'how likely is it that an abstract retrieval attempt will be successful'. We can tailor a more specific metric for a specific client, and use reputation to talk about the abstract portion of the metric that is general across all clients. Reputation then is made up of several component factors:

  • Capacity of a provider (is there sufficient bandwidth?)

  • Stability of a provider (is it under-provisioned, or unreliable)

  • How well the client-specific metric can predict the policy of the provider.

The simplest protocol

When thinking about retrieval of data, HTTP can provide a starting place for protocol design.

  • Client connects to a provider of data

  • Client asks for data GET /object/url

  • Provider responds with data.

Incentivization

In the above protocol, we have not touched on why the provider would respond.

Deviation 1: Provider doesn't provide data.

In the simple HTTP protocol we have left out a concept of a stateful relationship between the client and provider. Without reputation, an iterated game structure, or state, there is no incentive for the provider to give data to the client. Sending data is a use of a finite bandwidth resource, so it is cheaper for a provider to 'defect' or not pay that cost.

There is not a single standard for how directly providers account for this cost. In many cases it's an abstract cost that providers balance with their primary business model revenue.

As bandwidth costs become more dominant to a provider's business - think movie streaming or file sharing, it would be expected that a provider would track or 'meter' the amount of data transfer for individual customers, and would implement policies about acceptable amount of traffic on a per-client basis. Even when the cost is not individually accounted, providers may use abuse mitigation systems like DDOS prevention that impose limits on individual user bandwidth costs.

In both of these cases, the provider will typically use a 'cookie' for logged in users to identify the different data transfers that are made within the context of that user relationship. For public content where the user has not officially registered or logged in, than these metering systems will use a heuristic like the IP address of the client as a way to approximate these same policies.

Reputation

We also have not yet scoped in the concept of reputation - why does the client go to this provider, and more importantly, why was the data stored with this provider in the first place?

This is a difficult property to quantify and one where today reputation is most commonly based on authority rather than any quantitative metric.

When you consider storing data on a web host, you do not typically get insight into their abuse mitigation policies. As such you can't fully quantify what percentage of your users will and will not be able to actually access the content you pay for. In many cases, the provider may not even know - as that final understanding of availability is a property the path that traffic takes over the internet and can be disrupted by any of the ISPs along that path. That your server in an AWS data center is not accessible to users in China is not something that AWS would typically be held accountable for. AWS does not prevent the traffic from China, rather access is a hybrid of both the provider's abuse mitigations and the firewall policies of client organizations or countries.

Similarly, when you put content behind cloudflare to keep it available and mitigate excessive bandwidth costs, you don't get particularly precise accounting from cloudflare of which connections are blocked, and which of your users are not able to access the content. Instead, we use the reputation of cloudflare as a legitimate DDOS prevention service to understand that the bulk of blocked connections are probably abusive and not legitimate users.

Reputation constraints

There are a set of properties that we would hope to be true about a reputation feedback-loop that would replace the need for authority / other forms of trust in a retrieval system.

  • global & permissionless - the general reputation metric is one that should be deterministic, and able to be re-calculated by the different participants in the system.

  • resilient - The metric should reflect relative performance of providers even in the presence of

    • Storage providers acting in their rational financial best interest

    • A fraction of storage providers acting in a byzantine way with regards to the protocol

    • Clients may be operated by providers. The specific threat model of clients is itself associated with a complex set of constraints that is discussed next

Constraints on clients

The retrieval problem is two sided. As such, some model of clients is needed. This comes in a couple forms.

First, there needs to be some stream of data coming from clients. This may be directly through client reports to a decentralized system, or it could be through their relationships with services or providers.

Second, we would need some model for the cost associated with bad client data. This cost/adversary model will be different depending on the structure of reports. For instance, if client data is reported via providers, then a model would center on how many of those providers are reporting accurate telemetry. A model where clients have a longer term cryptographic identity, and especially one where the client identity has a cost to establish - from reports over time, staking, or other direct monetary mechanism - would potentially lead to different modeling of what can be understood from client data.

One of the major limitations that has been encountered in reputation systems to date is that the incentives that may be structured around clients and client reports are typically very lopsided compared to the incentives for providers. Providers have direct financial incentives to maximize by being seen as 'high reputation'. In contrast, there are many more clients, so the financial reward for individual reports is unlikely to be high. The web model where client identities are transient and typically do not require cost to attempt initial retrievals makes it difficult to prevent a provider from running many clients and using them either to report good data for themselves, or to report adversarial data overall about the system.

What this means for retrieval systems

There are three take-aways that come out of the constraints described here.

The first is that practical decentralized retrieval systems should start with a ‘thick client’ - a mobile application library or a fully integrated experience that has been able to be successful as evidenced by clients like IPFS-desktop and Bit torrent. Once the understood policies can be predicted with high accuracy in these types of systems, they can become sufficiently reliable defaults that they can allow extensions to more limited web client environments. Starting with a limited web client is less advisable for an MVP because of the limitations around observability and adaptability.

The second is that a reputation feedback loop is critical to any performance focused data transfer system. In centralized CDNs today, this is accomplished through health monitoring and log analysis. Production systems integrate monitoring both on each end node, and through reports of bandwidth and other metrics in the routers and gateways that requests flow through. Together these systems allow for dynamic reconfiguration to allow for resilience in the presence of individual component failures. This subsystem remains one of the largest unsolved challenges in a decentralized setting.

Conclusion

The ultimate take away from this description of constraints is that there's a tension, primarily at the reputation level, between how well defined the reputation metric is, and how well modeled a client measurement system needs to be. There are some components of a general reputation, like provider capacity, that can be measured in the abstract. A well-modeled client-specific metric, that may be more possible with a client application that records and calculates local preferences for providers over time, makes it easier to identify anomalies and get to a stable abstract reputation. In contrast, the transient, identity-less clients in a web context provide limited context to create a strong model and require a more powerful abstract provider reputation in order to offer a good experience.

Creating retrieval systems at a global scale is a monumental endeavor. The systems that exist today have done so with significant physical infrastructure investments, and have required the implementation of complex custom software, political negotiation, and business development. At the same time there is a tantalizing promise in the success of systems like bit torrent that cheaper infrastructure could be sufficient in many cases. While we may work towards this optimization, we should remain sober to the different requirements and reasons for the complexity and infrastructure that has emerged in the existing CDN space.

Further Reading

  • The Measurement Lab has a good set of tooling and documentation around capacity measurement.

  • Iperf is a tool and community for two-party network measurements.

  • This article dives further into the difference between ‘Data availability’ and data storage

Subscribe to Will Scott
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.