Setup
In order to monitor the system’s various parts, we use Prometheus and Grafana.
Prometheus has two parts: exporters and scrapers.
Using various Prometheus exporters (see below), we export various metrics as a local web service. Each job is running on different port (eg. 9100).
Then the Prometheus scrapers, visit the aforementioned web services, extract all the metrics and ingest them in the Prometheus Database.
Using the Prometheus data (that has been ingested in the database), there are various dashboards created using Grafana.
This gives us a visual way to interpret the data, but also an alerting functionality:
- Using a Keybase channel, we post messages from Grafana using a webhook, whenever an alert is triggered (eg. high CPU usage)
Monitoring
We have various exporters running in our server that export a broad range of metrics. Namely:
- A node exporter that we use to measure our servers’ resources like memory, CPU and disk utilization
- A postgres exporter that exports metrics about the Postgres database like memory usage, active sessions and transactions. On top of the database metrics, we have created some “business” metrics, that help us monitor some assumptions we have for the data using custom SQL queries (see below Block time lag”)
- and a process exporter that mainly monitors process uptime. We use this to verify that our Cardano node and Cardano db-sync processes are up and running
Dashboards
Grafana can be accessed here (it requires authentication) There are also some public dashboards that you can view:
- Block time lag
Here you can view 2 main metrics:
- The difference between latest block’s time and the current time
- The difference between the block’s creation time and the block’s insertion time in the database These metrics are shown across all our databases.
- A Postgres monitoring dashboard (requires auth) Here you can view all metrics around all our database instances.
- A Node exporter dashboard (requires auth) Depicting various metrics around the resources used by our nodes.
- A Named processes dashboard (requires auth) Showing the uptime of db-sync and node processes.