MultiversX

Mainnet Gateway

AWS:

  • m5a.2xlarge or any equivalent instance type

Bare Metal:

  • 32GB RAM

  • 8 vCPUs

  • At least 1 TB of storage - make sure it's extendable

Assumptions

We're going to assume you are already logged into your Virtual Machine as a privileged user or as the root user. The easiest way to set up an Observing Squad for Elrond is via the installation scripts provided by the Elrond Team.

Setup

First of all, we will need to clone the Github project containing this set of scripts (make sure you have git installed for this step):

git clone https://github.com/ElrondNetwork/elrond-go-scripts-mainnet.git

After cloning the repository, we're going to change the directory into the cloned file structure and edit config/variables.cfg.

CUSTOM_HOME="/home/username"
CUSTOM_USER="username"

Please replace username above with your VM user's name (i.e ubuntu). If you're running under the user root, then CUSTOM_HOME should be set to "/root" and CUSTOM_USER should be set to "root".

If you would like to enable logging on the VMs filesystem, under the logs folder inside the home of your user, we will need to set this in config/variables.cfg aswell:

NODE_EXTRA_FLAGS="-log-save"

Finally, we are now ready to run the script that will set up our Observing Squad. Please make sure your CUSTOM_HOME directory exists beforehand:

./script.sh observing_squad

After the installation process is finished, the VM will now have 5 new systemd units enabled. In order to start the Observing Squad, please run:

./script.sh start

That's pretty much it. Now you have a running Observing Squad on Elrond Mainnet.

Now, in order to run some health checks, you might want to know, if you do not already, that the Elrond blockchain is split into 4 separate shards, each handling a set of transactions (Shard 0, Shard 1, Shard 2, Metachain Shard - 4294967295). Each one of these Observers that we set up are running individually on one shard - so there is only one Shard 0 Observer, only one Shard 1 Observer and so on.

In order to check their status, we can use a tool called termui that was installed in the previous steps, as below:

curl http://localhost:8080/node/status | jq    # Observer - Shard 0
curl http://localhost:8081/node/status | jq    # Observer - Shard 1
curl http://localhost:8082/node/status | jq    # Observer - Shard 2
curl http://localhost:8083/node/status | jq    # Observer - Shard Metachain

In order to obtain a behaviour similar to https://gateway.elrond.com/, we also need to keep in mind that any requests coming from the outside to these observers first goes through a component called elrond-proxy which is responsible for balancing the requests and sending them to the appropriate observer.

The status of this component can be checked by running

journalctl -f -u elrond-proxy.service

In order to check sync status on a certain observer, we can run the following API call:

curl --silent curl http://localhost:8079/network/status/<shard_id> | jq '.data.status.erd_nonce'

and comparing it to the result of:

curl http://localhost:8079/network/status/<shard_id> | jq '.data.status.erd_nonce'

If the numbers coincide, it means the Observer is fully synced.

Elrond does not support WS or RPC calls, only REST calls.

The usual REST port for Elrond Gateway is 8079.

Monitoring Guidelines

In order to maintain a healthy node that passes the Integrity Protocol's checks, you should have a monitoring system in place. Blockchain nodes usually offer metrics regarding the node's behaviour and health - a popular way to offer these metrics is Prometheus-like metrics. The most popular monitoring stack, which is also open source, consists of:

  • Prometheus - scrapes and stores metrics as time series data (blockchain nodes cand send the metrics to it);

  • Grafana - allows querying, visualization and alerting based on metrics (can use Prometheus as a data source);

  • Alertmanager - handles alerting (can use Prometheus metrics as data for creating alerts);

  • Node Exporter - exposes hardware and kernel-related metrics (can send the metrics to Prometheus).

We will assume that Prometheus/Grafana/Alertmanager are already installed (we will provide a detailed guide of how to set up monitoring and alerting with the Prometheus + Grafana stack at a later time; for now, if you do not have the stack already installed, please follow this official basic guide here).

We recommend installing the Node Exporter utilitary since it offers valuable information regarding CPU, RAM & storage. This way, you will be able to monitor possible hardware bottlenecks, or to check if your node is underutilized - you could use these valuable insights to take decisions regarding scaling up/down the allocated hardware resources.

Below, you can find a script that installs Node Exporter as a systemd service.

#!/bin/bash

# set the latest version
VERSION=1.3.1

# download and untar the binary
wget https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/node_exporter-${VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-*.tar.gz
sudo cp ./node_exporter-${VERSION}.linux-amd64/node_exporter /usr/local/bin/

# create system user
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter

# change ownership of node exporter binary
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# remove temporary files
rm -rf ./node_exporter*

# create systemd service file
cat > /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF

# enable the node exporter service and start it
sudo systemctl daemon-reload
sudo systemctl enable node_exporter.service
sudo systemctl start node_exporter.service

As a reminder, Node Exporter uses port 9100 by default, so be sure to expose this port to the machine which holds the Prometheus server. The same should be done for the metrics port(s) of the blockchain node (in this case, we should expose ports 8080, 8081, 8082, 8083, depending on the shard).

Having installed Node Exporter and having already exposed the node's metrics, these should be added as targets under the scrape_configs section in your Prometheus configuration file (i.e. /etc/prometheus/prometheus.yml), before reloading the new config (either by restarting or reloading the config - please check the official documentation). This should look similar to this:

scrape_configs:
  - job_name: 'elrond-shard-0-node'
    scrape_interval: 10s
    metrics_path: /node/metrics
    static_configs:
      - targets:
        - '<NODE0_IP>:8080'
        - '<NODE1_IP>:8080' # you can add any number of nodes as targets
  - job_name: 'elrond-shard-1-node'
    scrape_interval: 10s
    metrics_path: /node/metrics
    static_configs:
      - targets:
        - '<NODE0_IP>:8081'
        - '<NODE1_IP>:8081' # you can add any number of nodes as targets
  - job_name: 'elrond-shard-2-node'
    scrape_interval: 10s
    metrics_path: /node/metrics
    static_configs:
      - targets:
        - '<NODE0_IP>:8082'
        - '<NODE1_IP>:8082' # you can add any number of nodes as targets
  - job_name: 'elrond-shard-metachain-node'
    scrape_interval: 10s
    metrics_path: /node/metrics
    static_configs:
      - targets:
        - '<NODE0_IP>:8083'
        - '<NODE1_IP>:8083' # you can add any number of nodes as targets
  - job_name: 'elrond-node-exporter'
    scrape_interval: 10s
    metrics_path: /metrics
    static_configs:
      - targets:
        - '<NODE0_IP>:9100'
        - '<NODE1_IP>:9100' # you can add any number of nodes as targets

In the configuration file above, please replace:

  • <NODE0_IP> - node 0's IP

  • <NODE1_IP> - node 1's IP (you can add any number of nodes as targets)

  • ...

  • <NODEN_IP> - node N's IP (you can add any number of nodes as targets)

That being said, the most important metrics that should be checked are:

  • node_cpu_seconds_total - CPU metrics exposed by Node Exporter - for monitoring purposes, you could use the following expression:

    • 100 - (avg by (instance) (rate(node_cpu_seconds_total{job="elrond-node-exporter",mode="idle"}[5m])) * 100), which means the average percentage of CPU usage over the last 5 minutes;

  • node_memory_MemTotal_bytes/node_memory_MemAvailable_bytes - RAM metrics exposed by Node Exporter - for monitoring purposes, you could use the following expression:

    • (node_memory_MemTotal_bytes{job="elrond-node-exporter"} - node_memory_MemAvailable_bytes{job="elrond-node-exporter"}) / 1073741824, which means the amount of RAM (in GB) used, excluding cache/buffers;

  • node_network_receive_bytes_total - network traffic metrics exposed by Node Exporter - for monitoring purposes, you could use the following expression:

    • rate(node_network_receive_bytes_total{job="elrond-node-exporter"}[1m]), which means the average network traffic received, per second, over the last minute (in bytes);

  • node_filesystem_avail_bytes - FS metrics exposed by Node Exporter - for monitoring purposes, you could use the following expression:

    • node_filesystem_avail_bytes{job="elrond-node-exporter",device="<DEVICE>"} / 1073741824, which means the filesystem space available to non-root users (in GB) for a certain device <DEVICE> (i.e. /dev/sda or wherever the blockchain data is stored) - this can be used to get an alert whenever the available space left is below a certain threshold (please be careful how you choose this threshold: if you have storage that can easily be increased - for example, EBS storage from AWS, you can set a lower threshold, but if you run your node on a bare metal machine which is not easily upgradable, you should set a higher treshold just to be sure you are able to find a solution before it fills up);

  • up - Prometheus automatically generated metrics - for monitoring purposes, you could use the following expression:

    • up{job="elrond-node"}, which has 2 possible values: 1, if the node is up, or 0, if the node is down - this can be used to get an alert whenever the node goes down (i.e. it can be triggered at each restart of the node);

  • erd_nonce - this expression is a metric that can be used in order to check if the local block height of our node is behind chain's :

    • increase(erd_nonce{job=~"elrond-shard-0-node"}[5m]), which is going to show the latest block that has been received by the Shard 0 node - this can be used to get an alert whenever the node is not syncing blocks anymore (i.e less than 5 blocks in the past 5 minutes);

    • increase(erd_nonce{job=~"elrond-shard-1-node"}[5m]), which is going to show the latest block that has been received by Shard 1 the node - this can be used to get an alert whenever the node is not syncing blocks anymore (i.e less than 5 blocks in the past 5 minutes);

    • increase(erd_nonce{job=~"elrond-shard-2-node"}[5m]), which is going to show the latest block that has been received by Shard 2 the node - this can be used to get an alert whenever the node is not syncing blocks anymore (i.e less than 5 blocks in the past 5 minutes);

    • increase(erd_nonce{job=~"elrond-shard-metachain-node"}[5m]), which is going to show the latest block that has been received by the Metachain Shard node - this can be used to get an alert whenever the node is not syncing blocks anymore (i.e less than 5 blocks in the past 5 minutes);

  • erd_num_connected_peers - for monitoring purposes, you could use the following expression:

    • erd_num_connected_peers{job="elrond-node-shard-0"}, which means the number of peers connected to the Shard 0 node - this can be used to get an alert whenever there are less peers than a certain threshold for a certain period of time (i.e. less than 3 peers for 5 minutes);

    • erd_num_connected_peers{job="elrond-node-shard-1"}, which means the number of peers connected to the Shard 1 node - this can be used to get an alert whenever there are less peers than a certain threshold for a certain period of time (i.e. less than 3 peers for 5 minutes);

    • erd_num_connected_peers{job="elrond-node-shard-2"}, which means the number of peers connected to the Shard 2 node - this can be used to get an alert whenever there are less peers than a certain threshold for a certain period of time (i.e. less than 3 peers for 5 minutes);

    • erd_num_connected_peers{job="elrond-node-shard-metachain"}, which means the number of peers connected to the Shard Metachain node - this can be used to get an alert whenever there are less peers than a certain threshold for a certain period of time (i.e. less than 3 peers for 5 minutes);

You can use the above metrics to create both Grafana dashboards and Alertmanager alerts.

Please make sure to also check the Official Documentation and the Github Repository posted above in order to make sure you are keeping your node up to date.

Last updated