This will cover an attempt at monitoring a series of light nodes and full storage nodes across different hardware profiles, hosted locally and externally.
To use any of the monitoring tools discussed here see:
Dashboard Overview- All Connected Nodes
Idea to setup light nodes and full storage across different hardware profiles and setup monitoring
Full Storage Nodes
Light Nodes
To quickly deploy nodes for testing I used my `multi-client` deployment scripts,
This was useful for deploying and re-deploying nodes quickly across different hardware types for testing, such as ARM based devices (Pinephone/ Rpi 4)
SNMP: (Simple Network Management Protocol)Â for hardware-based monitoring.
is a remote probe, which can be deployed to monitor most devices. SNMP is a widely used protocol designed for managing devices on IP networks.
This can be deployed to any device (for Linux systems)
sudo apt update
sudo apt install snmpd snmp libsnmp-dev
See example configuration settings for connecting to local network devices here:
To Open Externally:
To connect monitoring to servers outside the network
edit the config file:
nano /etc/snmp/snmpd.conf
under the rocommunity
section, open localhost and the server to connect too
Ensure port `161` is open on the device and on the monitoring device
ufw allow 161
PRTG: for alerts and dashboards
PRTG is a Windows based powerful network monitoring software, it can monitor any device with an IP address. SNMP is an open and supported protocol for hardware-based monitoring, PRTG is a one example compatible monitoring dashboard compatible with SNMP (including other protocols).
Download PTRG server to windows device:
Setup: It is very straight forward to add local network devices.
Setup account (default login: prtgadmin
: prtgadmin
), simply ‘add devices’ using the local IP and ‘recommend sensors’ to auto discover what is available. With SNMP enabled on the target device, the sensors should appear for selection.
Connecting the External Sensors
is more difficult to setup, some extra configuration is required (also see SNMP side)
Windows Firewall settings: add rules for Port 443 to allow incoming and outgoing, allow for 161 port used for SNMP.
Port forwarding: if windows server is locally hosted, port forward via router settings ports 161, 443 to the local IP of PRTG server
Notes on Windows settings: recommend playing with network settings, and sleep and auto-updates settings.
In PRTG Settings: Ensure in ‘probe connection settings: all IP addresses available on this computer’ is selected.
Selecting Sensors
There may be a lot of redundant, or duplicate sensors, this is a reduced list of the most useful
Arrange devices into groups
So long as PRTG remains active sensor data will be logged and can be reviewed, examples
Select device > sensor: example: CPU Load
As one problem such as the service failing to start / or restarting, especially during testnets can be a frequent occurrence. Example Issue here that I encountered myself during ‘blockspacerace’
It is useful to monitor the service liveness directly, so when the node encounters errors itself without hardware based failures, you can be alerted immediately.
SNMP monitors system sensors, and cannot tell if the celestia node service is active (same if deployed with docker)
This can be achieved by setting up a script on the server, the script simply script checks if the celestia-full.service
is active or error condition.
sudo systemctl status celestia-full
and output 1 for active or 0 for inactive
Then have PRTG run the script by adding it as a custom sensor and connecting via SSH.
I wanted to monitor DA node-based metrics, across the setup, in order to best compare performance against each other
using RPC API to query node metrics directly on the device, there are many RPC metrics available here
Sampling Stats
export CELESTIA_NODE_AUTH_TOKEN=$(celestia full auth admin --p2p.network blockspacerace)
celestia rpc das SamplingStats
celestia data availability nodes performance sampling on the data availability network, this is a good measure of performance, nodes that are struggling to sync will have trouble keeping up and the head_of_sampled_chain
will be far from network_head_height
A script was setup, to query the API every 15 minutes and store the results to a logfile, this can be deployed to any DA node to capture the data so it can be reviewed and graphed later.
The second part of this document is an attempt at analysing captured data using the setup in part 1: this was difficult to get anything really concreate due to encountered problems in remote access, despite the setup working and being able to access hardware metrics externally, I found out alerts need to be properly utilized, access configured and better standards to be able to monitor and address maintain a node cluster.
With the ARM devices I noticed regular downtime with gaps in monitoring and then the devices being off
coupled with high CPU usage as seen on the charts and the devices being very hot when physically checked it seemed that the ARM devices did not like running light clients put the CPU under a high load.
What seemed to curb this behaviour was adding more extreme cooling, I had to remove the back cover and place the phone on top of a fan and the Pi despite having a heat sink enclosure, had to be placed in a PC enclosure with more cooling.
Note: the version, this was running on version v0.9.1 which was known to have shrex
errors causing higher than normal CPU usage.
Updating and effect on CPU
Upgrading Devices seems to have a random effect on CPU load, v0.9.3 did not entirely fix the shrex
errors (seen in release notes) but as seen in the captures below in some cases worse CPU load or in some a reduction on CPU load.
Full Storage Nodes
Light Nodes
It’s hard to tell, but this may be the result of other processes more device related, which makes sense for non-dedicated devices (PinePhone/ Steam Deck)
Running v0.9.3 and monitoring while abroad had many issues such as downtime and setup. Shows the importance of properly setting up alerts and being able to address them, static IPs should have been set for local devices to avoid losing access because of port forwarding rules. Alerts should have been setup via PRTG as opposed to just relying on checking manually.
Light Nodes
Running v0.9.3 and 0.9.4 upgrade, capture over 2 days
Mobian was in down state, hence the large gaps, until restarted? And high load, while playing catch up, the upgrade spike (like others) pales compared to the high load under these conditions, still not sure went wrong here
Memory consistent during v0.9.3 operation for 2 days. Note: messy in pi, this is noise from the downtime field (should have been off and is incorrect). RPi CPU load is much more volatile compared to VPS.
Full Nodes
Running v0.9.4 and v0.9.5 upgrade
v0.9.5 is the upgrade which fixed many of the previous issues, such as shrex
errors leading to higher CPU load.
Setup automated PFBs on celestia nodes every minute to see the effects on hardware strain and performance.
using a script, I made here: https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/payforblob
this was active since, 0.9.4 upgrade on light nodes, and there was no apparent or noticeable effect on any hardware metrics, although perhaps single PFBs at minute intervals are not enough, I suspect its mostly the difficulties in version and trying to capture data remotely.
Note: as all PFBs are saved to a logfile, I’ve noticed that there are occasional failed PFBs, perhaps there might be some insight here to look into at a later date.
This was extremely difficult to capture and compare, because of the number of frequent updates and troubleshooting of devices to get monitoring and access right, I had to settle (for now) capturing sample of 24hrs on recently updated version v0.9.5
Full Nodes:
Device 2: ThinkCentre Left and VPS on right, seemed the locally run server had some trouble keeping up, but not far behind eventually catching up.
NOTE: that data from both Rpi and SteamDeck were omitted with issues of remote access at the time.
Checking when home – days later
Deck was still on v0.9.4 as unable to upgrade at the time, once update v0.9.5 was applied and nodes allowed to sync there was no real difference as far as performance between running locally and on virtual private server, it would be interesting though to get a longer capture and graph the output, Its still the case the most stable is the dedicated server throughout.
Light Nodes:
Problems: so as mentioned above, the issue with this is still early testnet, hard to tell whether the lagging in sampling was device related or because of PFBs. While issues with the phone, the Light node run on VPS (device 4) had near perfect sampling stats. Â
example from PinePhone: sampled is stuck at 489,908
but this seemed to have been cleared by running unsafe reset
but lagged very far behind
seemed to fix itself after being allowed to run for a while on its own.
Conclusion:
Lessons on the importance of properly setup configuration for system monitoring of node clusters and managing remote data access for maintaining. The setup covered in the first part of this document is a good starting point for monitoring multiple nodes, but this needs to be further optimised.
Need to run this for a longer period on a stable release with the same stable version across all devices, to be able to graph compare and get something more valuable.
In both node types it’s the dedicated sever that appears to have best performance, which is not surprising,
I thought spamming PFBs would have a more noticeable effect but appears this was not the case, although there were failed PFBs on the logfiles which might be worth looking into later.