Monitoring multiple Celestia nodes

This will cover an attempt at monitoring a series of light nodes and full storage nodes across different hardware profiles, hosted locally and externally.

To use any of the monitoring tools discussed here see:

Dashboard Overview- All Connected Nodes

Hardware - Nodes

Idea to setup light nodes and full storage across different hardware profiles and setup monitoring

Full Storage Nodes

Light Nodes

Node Setup

To quickly deploy nodes for testing I used my `multi-client` deployment scripts,

This was useful for deploying and re-deploying nodes quickly across different hardware types for testing, such as ARM based devices (Pinephone/ Rpi 4)

Monitoring Setup

SNMP: (Simple Network Management Protocol) for hardware-based monitoring.

is a remote probe, which can be deployed to monitor most devices. SNMP is a widely used protocol designed for managing devices on IP networks.

This can be deployed to any device (for Linux systems)

sudo apt update
sudo apt install snmpd snmp libsnmp-dev

See example configuration settings for connecting to local network devices here:

To Open Externally:

To connect monitoring to servers outside the network
edit the config file:

nano /etc/snmp/snmpd.conf

under the rocommunity section, open localhost and the server to connect too

Ensure port `161` is open on the device and on the monitoring device

ufw allow 161

PRTG: for alerts and dashboards

PRTG is a Windows based powerful network monitoring software, it can monitor any device with an IP address. SNMP is an open and supported protocol for hardware-based monitoring, PRTG is a one example compatible monitoring dashboard compatible with SNMP (including other protocols).

Download PTRG server to windows device:

Setup: It is very straight forward to add local network devices.
Setup account (default login: prtgadmin: prtgadmin), simply ‘add devices’ using the local IP and ‘recommend sensors’ to auto discover what is available. With SNMP enabled on the target device, the sensors should appear for selection.

Connecting the External Sensors

is more difficult to  setup, some extra configuration is required (also see SNMP side)

Windows Firewall settings: add rules for Port 443 to allow incoming and outgoing, allow for 161 port used for SNMP.

Port forwarding: if windows server is locally hosted, port forward via router settings ports 161, 443 to the local IP of PRTG server

Notes on Windows settings: recommend playing with network settings, and sleep and auto-updates settings.

In PRTG Settings: Ensure in ‘probe connection settings: all IP addresses available on this computer’ is selected.

Selecting Sensors

There may be a lot of redundant, or duplicate sensors, this is a reduced list of the most useful

Arrange devices into groups

So long as PRTG remains active sensor data will be logged and can be reviewed, examples

system memory usage- from Pinephone running light client
system memory usage- from Pinephone running light client

Select device > sensor: example: CPU Load

CPU load over 7 days – Light node Device 4
CPU load over 7 days – Light node Device 4

Monitoring Celestia Service - Liveness

As one problem such as the service failing to start / or restarting, especially during testnets can be a frequent occurrence. Example Issue here that I encountered myself during ‘blockspacerace’

It is useful to monitor the service liveness directly, so when the node encounters errors itself without hardware based failures, you can be alerted immediately.

SNMP monitors system sensors, and cannot tell if the celestia node service is active (same if deployed with docker)

This can be achieved by setting up a script on the server, the script simply script checks if the celestia-full.service is active or error condition.

sudo systemctl status celestia-full and output 1 for active or 0 for inactive

Then have PRTG run the script by adding it as a custom sensor and connecting via SSH.

Monitoring DA node performance

I wanted to monitor DA node-based metrics, across the setup, in order to best compare performance against each other

using RPC API to query node metrics directly on the device, there are many RPC metrics available here

Sampling Stats

export CELESTIA_NODE_AUTH_TOKEN=$(celestia full auth admin --p2p.network blockspacerace)
celestia rpc das SamplingStats

celestia data availability nodes performance sampling on the data availability network, this is a good measure of performance, nodes that are struggling to sync will have trouble keeping up and the head_of_sampled_chain will be far from network_head_height

A script was setup, to query the API every 15 minutes and store the results to a logfile, this can be deployed to any DA node to capture the data so it can be reviewed and graphed later.


Monitoring Analysis

The second part of this document is an attempt at analysing captured data using the setup in part 1: this was difficult to get anything really concreate due to encountered problems in remote access, despite the setup working and being able to access hardware metrics externally, I found out alerts need to be properly utilized, access configured and better standards to be able to monitor and address maintain a node cluster.

1. CPU Temp problem - ARM devices

With the ARM devices I noticed regular downtime with gaps in monitoring and then the devices being off

SNMP CPU (Mobian left and RPi right), Red is downtimewhere device would have shutdown. 
SNMP CPU (Mobian left and RPi right), Red is downtimewhere device would have shutdown. 

coupled with high CPU usage as seen on the charts and the devices being very hot when physically checked it seemed that the ARM devices did not like running light clients put the CPU under a high load.

What seemed to curb this behaviour was adding more extreme cooling, I had to remove the back cover and place the phone on top of a fan and the Pi despite having a heat sink enclosure, had to be placed in a PC enclosure with more cooling.

Note: the version, this was running on version v0.9.1 which was known to have shrex errors causing higher than normal CPU usage.

2. Updating Nodes

Updating and effect on CPU

Upgrading Devices seems to have a random effect on CPU load, v0.9.3 did not entirely fix the shrex errors (seen in release notes) but as seen in the captures below in some cases worse CPU load or in some a reduction on CPU load.

Full Storage Nodes

Light Nodes

It’s hard to tell, but this may be the result of other processes more device related, which makes sense for non-dedicated devices (PinePhone/ Steam Deck)

3. General Metrics comparison

Running v0.9.3 and monitoring while abroad had many issues such as downtime and setup. Shows the importance of properly setting up alerts and being able to address them, static IPs should have been set for local devices to avoid losing access because of port forwarding rules. Alerts should have been setup via PRTG as opposed to just relying on checking manually.

Light Nodes

Device 4: VPS
Device 4: VPS

Running v0.9.3 and 0.9.4 upgrade, capture over 2 days

Device 6: PinePhone
Device 6: PinePhone

Mobian was in down state, hence the large gaps, until restarted? And high load, while playing catch up, the upgrade spike (like others) pales compared to the high load under these conditions, still not sure went wrong here

Device 5: Pi 4b
Device 5: Pi 4b

Memory consistent during v0.9.3 operation for 2 days. Note: messy in pi, this is noise from the downtime field (should have been off and is incorrect). RPi CPU load is much more volatile compared to VPS.

Full Nodes

Running v0.9.4 and v0.9.5 upgrade

v0.9.5 is the upgrade which fixed many of the previous issues, such as shrex errors leading to higher CPU load.

4. PFB Transactions

Setup automated PFBs on celestia nodes every minute to see the effects on hardware strain and performance.

using a script, I made here: https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/payforblob

this was active since, 0.9.4 upgrade on light nodes, and there was no apparent or noticeable effect on any hardware metrics, although perhaps single PFBs at minute intervals are not enough, I suspect its mostly the difficulties in version and trying to capture data remotely.

Note: as all PFBs are saved to a logfile, I’ve noticed that there are occasional failed PFBs, perhaps there might be some insight here to look into at a later date.

5. DAS Performance comparison

This was extremely difficult to capture and compare, because of the number of frequent updates and troubleshooting of devices to get monitoring and access right, I had to settle (for now) capturing sample of 24hrs on recently updated version v0.9.5

Full Nodes:

Note: difference in time zone, UTC / local UTC-1
Note: difference in time zone, UTC / local UTC-1

Device 2: ThinkCentre Left and VPS on right, seemed the locally run server had some trouble keeping up, but not far behind eventually catching up.

NOTE: that data from both Rpi and SteamDeck were omitted with issues of remote access at the time.

Checking when home – days later

Deck was still on v0.9.4 as unable to upgrade at the time, once update v0.9.5 was applied and nodes allowed to sync there was no real difference as far as performance between running locally and on virtual private server, it would be interesting though to get a longer capture and graph the output, Its still the case the most stable is the dedicated server throughout.

Light Nodes:

Problems: so as mentioned above, the issue with this is still early testnet, hard to tell whether the lagging in sampling was device related or because of PFBs. While issues with the phone, the Light node run on VPS (device 4) had near perfect sampling stats.  

example from PinePhone: sampled is stuck at 489,908

but this seemed to have been cleared by running unsafe reset but lagged very far behind

After v0.9.5 and reset
After v0.9.5 and reset

seemed to fix itself after being allowed to run for a while on its own.

Conclusion:

Lessons on the importance of properly setup configuration for system monitoring of node clusters and managing remote data access for maintaining. The setup covered in the first part of this document is a good starting point for monitoring multiple nodes, but this needs to be further optimised.

Need to run this for a longer period on a stable release with the same stable version across all devices, to be able to graph compare and get something more valuable.

In both node types it’s the dedicated sever that appears to have best performance, which is not surprising,

I thought spamming PFBs would have a more noticeable effect but appears this was not the case, although there were failed PFBs on the logfiles which might be worth looking into later.

Subscribe to GLCstaked
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.