Monitoring dimensions
A validator operator covers four dimensions. Each has its own metrics, thresholds, and response playbook.Node resources
CPU, memory, disk, network — the floor underneath everything else.
Consensus
Block production, vote participation, finality, fork events.
Trading engine
TPS, latency, queue depth, liquidation and ADL activity.
Network and peers
P2P connection count, RPC latency, geographic distribution.
Node resources
| Signal | Alert threshold | Why |
|---|---|---|
| CPU usage | > 85% for 5 minutes | Close to saturation; consensus and engine will begin to slip |
| Memory usage | > 70% for 3 minutes or > 90% for 1 minute | Approaching OOM; trading-engine memory pressure is the usual cause |
| Swap | Any use | Swap should be disabled; any activity is a misconfiguration |
| Disk IOPS | > 80% of device ceiling | State writes and block import will slow |
| Disk free | < 20% | Block and state growth can stall |
| Disk await | > 10ms | Block import latency is about to rise |
| Network bandwidth | > 50% of link capacity | Block propagation risk |
| RPC latency | > 100ms | Client-facing slowness |
Consensus signals
IntentionBFT is a HotStuff-family protocol with canonical sequencing commitments. Validators expose:- Block production time. A design target is published per network; alert when the rolling average exceeds 1.5× that target.
- Node voting rate. < 90% over the voting window is a P1; this indicates the node is missing votes and affects finality.
- Finality state. Alert whenever finality stalls for more than one block interval.
- Fork count. Any fork event is a P0 — investigate immediately.
- Validator online count. Drop below is catastrophic; drop below is a P1 warning to avoid a cliff.
- P2P connection count. Stay inside
[20, 300]. Below 20 isolates the node; above 300 burns resources.
Trading engine signals
The headline engine-layer numbers validators watch are:- TPS (target 50K steady-state, 100K goal), with P1 at 25K and P0 at 10K.
- Latency:
p50 < 500ms,p99 < 1s. - Mempool pending < 50,000 (hard ceiling 100,000).
- Matching queue < 25,000 (hard ceiling 50,000).
Alerting discipline
The standard stack is Prometheus + Alertmanager + Grafana, with optional log analytics for post-hoc investigation. Alerts are categorized:- P0 — service unavailable or safety-affecting. Pages the on-call immediately.
- P1 — performance degradation or business-visible impact. Pages in working hours, escalates out-of-hours.
- P2 — latent risk. Shows up on the dashboard; no page.
node_down suppresses high_cpu on the same host, and network_congestion correlates with latency_high into a single incident.
Edge-level defenses — connection limits, per-IP weight, address budgets, congestion fair-share — are described in DDoS and Sybil defense. For where metrics come from in the stack, see Network topology.