Monitor Nodes With Just The Graph

Introduction

I am not a sever management professional. But I’ve managed to teach myself a few things over the years. When running nodes there’s a lot of system management that needs to be accounted for, because 100% uptime is the goal of the clients of any protocol. There are highly advanced process management tools provided by AWS and others that allow you to look directly at the system and decide whether it is behaving the way you desire or not. But, for a quick and dirty solution, I’ll use TheGraph’s hosted API, a simple Python script, and Cron to manage a node instead, simply checking whether or not the system is active on the network at a given moment. The nice aspect of this approach is that you can generalize it to any system that is running on/with a chain that has a graph deployment; the downside is that standalone blockchains, like Arweave, or Pocket, do not apply here. I’ll use a Livepeer node as an example but there are many other systems that are similar and potentially applicable (such as TheGraph itself).

Initial Setup

First, we’ll need a node actually setup and running. This involves securing the necessary hardware and capital for staking your tokens with said node. Once setup with a wallet, some tokens tokens to stake and an RPC endpoint, everything should be ready to go.

TheGraph

The Graph is a really great system because it exposes all kinds of data about what is going on in smart contracts via a REST API, which we can use GraphQL to query and receive information about. With traditional API requests, which many are used to using, you get all the information that the server has for a particular request, potentially bounded by a few parameters. This is fine for something like price data, which tends to be a pretty small dataset, but can still exceed how much bandwidth someone actually wants to allocate to you. The result is many APIs paginate their results, making accessing what exact data you want more time consuming. Enter GraphQL, which makes querying big datasets like the entire state history of a smart contract feel like querying a database. You construct a JSON like object that specifies exactly which data you need; the column names, the block number, address in question, etc. This way we can easily sort through the volumes of data a smart contract is generating every few seconds on the Ethereum main chain. The Graph itself is a decentralized network made up of nodes that index these data about different smart contracts and stake their GRT tokens in order to ensure that they are playing by the rules. Alongside the decentralized network, which is now live but still in the process of bootstrapping, there is also a hosted service that gives the user free access to most of the data on the network through a single API endpoint. All that is needed is to create a POST request with the GraphQL query as the payload and send it to the API. Today, many popular DeFi frontends, including the Livepeer Explorer and Uniswap’s info.uniswap.org, are powered by this service.

Monitoring

The monitor of the Livepeer node will consist of three parts: A shell script that can turn the node on or off, a python script that checks The Graph’s API for us, and a cron job that runs at regular intervals.

The Shell Script

This can be as complicated or as simple as it needs to be. For me, all that is need is to run /path/to/livepeer {runtime parameters} where the parameters would include things like the password to unlock the wallet keystore, the RPC URL for Arbitrum One, and the directory where the data are stored. For obvious reasons I won’t put that information in here. What will become important later is to note that all filepaths must be absolute paths from root, or else cron will not function properly.

The Python Script

The script needs to do two things: 1) query the appropriate subgraph, then check if the node needs a restart. The first of these two things will be accomplished by making a POST request to the subgraph /subgraphs/livepeer/arbitrum-one (note Livepeer recently migrated to Arbitrum as an L2 solution). Next, we need to build a query for this subgraph. Livepeer does not have a pulse that is pushed on-chain (although this would be cool, especially now that there are cheap tx via Arbitrum) but what we can check is if the node has submitted its reward tx in the current rewards cycle or not, which each last about 17 hours. Of course, I want the node to be running more than once every 17 hours (preferably all the time) so we will check far more frequently than this when the cron job is set up. However, the frequency of crashes, usually due to overloading the RPC endpoint, or the resources of the server itself, is infrequent enough (from my own observations running this node for more than a year) that the right checking frequency will translate to really good uptime.

To get the information I want I will do two queries. Once to find out what the current reward round of the protocol is, and once to query the Orchestrator in question what its last submitted reward was. If these numbers are the same, then everything is good. If not, we need to restart the node again. I wrote the following queries:

"query protocols {
   protocol(id: \"0\") {
     id
     currentRound {
       id
     }
  }
}"

and

"query transcoders {
   transcoder(id: \"<TRANSCODER_ETH_ADDR>\") {
     lastRewardRound {
       id 
    }
  }
}"

Then, parsing the result (which is returned as JSON) we obtain two integers, which hopefully are the same. Next all that is needed is a call to os.system(‘bash /path/to/shell/script’) that will restart the node. The script really is an extra step, but it allows for the various runtime parameters to be kept out of the python, letting it look nicer.

The Cron Job

Cron is the task scheduler for Unix. It allows you to easily and simply create tasks that the system will execute at the specified time of day. All you have to do is edit the crontab file, by running crontab -e then adding a line to the bottom of the file. Because they’re run as if there were an imaginary user sitting at / on your system, you need to specify absolute paths to everything. I use crontab.guru, to define the range I want for the job. You just need to type ‘cron job every x hours’ in plain english and the result is the right series of numbers and asterisks that produce that result. For example, a job every 2 hours would be */30 * * * * /usr/bin/python3.6 /path/to/script.py . I’ve found running this job every 30min to 2 hours or so to be optimal, because the average run time exceeds one reward cycle by a lot. So when the script catches a crash it will not have been down long, without the need to send too many requests to The Graph servers.

Wrapping Up

With all this together, I can rest easy knowing the Livepeer node is running correctly in the way that it should. While this is not the optimal, most professional solution to a problem such as monitoring, this very simple system is a way for me to accomplish a goal in terms that I clearly understand. Additionally there are a number of other use cases, such as historical rewards and fee monitoring, that could be built out from this base. The rough idea of it should also be generalizable into other protocols and fits nicely into other analytics tools I built on top of the graph. Hopefully this is helpful to someone out there!

The full script:

import requests
import json
import os
from datetime import datetime

url = "https://api.thegraph.com/subgraphs/name/livepeer/arbitrum-one"

def get_current_round():
    data = {"variables":{}}
    query = "query protocols {\n protocol(id: \"0\") {\n id\n currentRound {\n id\n }\n}}"
    data["query"] = query
    response = requests.post(url,json.dumps(data))
    try:
        return int(response.json()['data']['protocol']['currentRound']['id'])
    except KeyError:
        return None

def get_orchestrator_current_round():
    query = "query transcoders {\n transcoder(id: \"<TRANSCODER_ADDRESS>\") {\n lastRewardRound {\n id\n }\n}}"
    data = {"variables":{},"query":query}
    response = requests.post(url,json.dumps(data))
    try:
        return int(response.json()['data']['transcoder']['lastRewardRound']['id'])
    except KeyError:
        return None

def check():
    protocol_round = get_current_round()
    orchestrator_round = get_orchestrator_current_round()
    if orchestrator_round < protocol_round:
        os.system('nohup bash /path/to/LIVEPEER_RUN_SCRIPT.sh &')
        with open('/home/ubuntu/cron_log.txt', 'a') as f:
            n = datetime.now()
            f.write(f"Time: {n} Status: RESTART\n")
        return -3
    else:
        with open('/home/ubuntu/cron_log.txt', 'a') as f:
            n = datetime.now()
            f.write(f"Time: {n} Status: OK\n")
        return 0
Subscribe to Christian
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.