Data is valueless unless people have the power to extract the valuable information from them. In the traditional Web2.0 era, huge amount of data is stored and processed in clusters managed by centralized company entities. The volume of the data is so large that they have to be stored in tens of thousands of servers. Therefore, the data have to be processed in a distributed way where computation is also divided into millions of small tasks and dispatched to servers that are close to the data needed. These small computation tasks exchange the intermediate results to produce the final result of the entire process job. This is often called dataflow computation which is the key to extract value from large scale of data. The Web2.0 giants, e.g., Microsoft, Google, Meta, etc., have spent years to develop large scale dataflow computation engines to help them bring huge values from their data to the companies.
But now, the era is changing, people start to not trust those giants and want to have the ownership of their own data. This desire drives the birth of the decentralized storage infrastructure which is a key part of the Web3.0 world. The representatives are Filecoin and Arweave which are pioneers in this transition. There is already mass of data stored there so that they will exist permanently. No central party can erase any of the data from the system. This is achieved by large number of decentralized servers contributing storage resources and getting rewarded by some kind of consensus. This is a great milestone towards the Web3.0 storage infrastructure. But is it enough?
As pointed above, data have no value if cannot be processed. This is a huge missing part of the existing decentralized infrastructures, which shrinks the value of the data they store and in turn makes the value of the systems much less than ideal. We really need a distributed computation framework on the decentralized storage infrastructure like the dataflow way to wake up the data to emit the valuable information inside, otherwise, they are just sleeping there.
Designing and developing such kind of computation runtime on the decentralized storage infrastructure is far from trivial. It may require years of work from an excellent engineering team. New challenges will be encountered in this journey. First of all, some economic mechanisms need to be integrated into the distributed computation runtime. Since the servers storing the data are decentralized, the computation sub-tasks running on the server owning the needed data need to pay for the computation resources on the server. Secondly, it may require much more subtle and trickier access controls on the data because the ownership of the data belongs to specific individuals rather than some giant company. Thirdly, valuable information needs to be extracted from structured data which usually require the key-value-like abstraction. However, the current infrastructure like Filecoin and Arweave have not provided any such kind of utility to help users organize their data in an effective format. Fourthly, since the structured data need to be written into the decentralized storage during the data analysis computation, the latency of the data ingest of the storage needs to be extremely short. The existing systems are far from satisfactory in this aspect.
There could be many other challenges on the road towards this goal, but I believe one day it will be achieved by some brave explorers, because the Web3.0 world should not be a place where the real value of the valuable data being buried.