No fancy GPUs or complex installs, complete this guide in under 5 minutes.
This guide will cover how to output a protein data bank (pdb) structure file from an inputted amino acid sequence using a decentralized computing network. The only requirement is running a bash install script. Technologies highlighted in this guide are InterPlanetary File System (IPFS), Bacalhau, and Evolutionary Scale Modeling (ESM) fold.
Biologics are therapeutics derived from living organisms. These therapeutics are much larger than traditional small molecules. For example, a typical small molecule like Aspirin is around 20 atoms, whereas a typical monoclonal antibody (the most common biologic by revenue) is around 25,000 atoms. This allows for much more complexity in building specific mechanisms of action to tackle the treatment of challenging diseases. Biologics are having a growing impact on pharmaceuticals and are expected to surpass small molecules in revenue around the year 2027. The additional complexity of biologics makes computer-aided design even more important for efficiently navigating the large design search space. Large tech companies like Alphaphabet and Meta releasing protein folding models is just one example of growing interest in the space. This tutorial uses Meta’s ESM protein folding model because the required data to run the trained model, 10GB, can fit into a Docker image for Bacalhau whereas the AlphaFold model requires 2TB of data.
Common methods in computer-aided protein design such as force field molecular dynamics and deep learning require high throughput computing. Most small groups or individuals do not have the resources to set up a compute cluster at the start of a project. Cloud computing services such as Amazon Web Services (AWS) make it easy for anyone to rent exactly the computational resources needed for their project. This has helped many software startups scale quickly. Cloud computing services also offer an ecosystem of software tools that manage compute infrastructure. Although this proprietary tooling is easy to use, it makes it technically time-consuming to move projects away from the cloud provider. Additionally, the costs for these cloud providers can add up quickly, especially for the GPU resources that are most efficient for many comp bio programs. For example, the AWS g4dn.4xlarge instance that I used to develop the software behind this tutorial costs roughly $1.20 an hour or $876 a month. Buying a new computer with similar hardware would currently cost me around ~$7,500. In this scenario, I would break even in less than a year if I bought instead of rented. Additionally, the technology behind peer-to-peer computing networks allows the development of incentive systems for completing jobs. This allows developers to earn money on their infrastructure when it is not in use. The company behind the decentralized computing network used in this guide has already successfully launched an incentive layer for decentralized data storage called Filecoin. Overall, the open tooling behind decentralized infrastructure helps change computing infrastructure from a privatized expense to a common good for developers.
Bacalhau is peer-to-peer compute system that can run any public Docker container and access data on IPFS. This installs the command line tool for submitting jobs to the network and downloading results.
curl -sL https://get.bacalhau.org/install.sh | bash
sequence input to any amino acid sequence of interest.
Then use the Bacalhau command interface to submit the job to the network. Openzyme has public a Docker image on Dockerhub with ESM install and a Python wrapper script. Both the Dockerfile and code for this interface is in the Openzyme repository. Bacalhau uses this Docker image to complete the job.
bacalhau docker --gpu 1 --memory 30gb run openzyme/compbio:a0.3 python ./workflows/fold-protein.py $sequence
You should see an output similar to below:
Job successfully submitted. Job ID: 3ba00839-b8bf-4558-9e6b-f1ab51badd1e Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running): Creating job for submission ... done ✅ Finding node(s) for the job ... done ✅ Node accepted the job ... done ✅ Job finished, verifying results ... done ✅ Results accepted, publishing ... Job Results By Node: To download the results, execute: bacalhau get 3ba00839-b8bf-4558-9e6b-f1ab51badd1e To get more details about the run, execute: bacalhau describe 3ba00839-b8bf-4558-9e6b-f1ab51badd1e
bacalhau get <job-id>
You should see an output similar to below
Fetching results of job '8dcb1258-32d0-4cd0-a27f-e4d8910f02a4'... Results for job '8dcb1258-32d0-4cd0-a27f-e4d8910f02a4' have been written to... /home/ubuntu/job-8dcb1258
Note the directory path of the result download.
If you get an error about
failed to sufficiently increase receive buffer size, try the below command:
sudo sysctl -w net.core.rmem_max=2500000
Lastly, navigate to
./combined_results/outputs/result.pdb from the printed output directory. For example,
./job-8dcb1258/combined_results/outputs/result.pdb. I recommend using Molstar.org to view the pdb file output. Molstar even has a great extension for VSCode.
Congrats you made it to the end of the tutorial 🎉.
An astute reader may have noticed that the amino acid sequence used in this example is for green fluorescent protein (GFP) and that the predicted output structure is only partially correct compared to the experimentally produced structure. Future posts will add more comp bio features and go over analyzing output results for confidence in correctness. Additionally, the current interface with comp bio workflows and Bacalhau is involved and cumbersome. Future work will improve this user experience.
Give a follow here or on Twitter if you would like to stay up to date with the development of comp bio on decentralized computing.
Additionally, if you found this code tutorial interesting or helpful then please star the Openzyme repo.