Monitoring and Observability

This page is for validator operators and SREs. It describes the monitoring surface an Intention validator is expected to run against itself. The goal is a closed loop: metrics → alerts → response, with every alert actionable and every signal mapped to a failure it is trying to catch.

Monitoring dimensions

A validator operator covers four dimensions. Each has its own metrics, thresholds, and response playbook.

Node resources

CPU, memory, disk, network — the floor underneath everything else.

Consensus

Block production, vote participation, finality, fork events.

Trading engine

TPS, latency, queue depth, liquidation and ADL activity.

Network and peers

P2P connection count, RPC latency, geographic distribution.

Node resources

Signal	Alert threshold	Why
CPU usage	> 85% for 5 minutes	Close to saturation; consensus and engine will begin to slip
Memory usage	> 70% for 3 minutes or > 90% for 1 minute	Approaching OOM; trading-engine memory pressure is the usual cause
Swap	Any use	Swap should be disabled; any activity is a misconfiguration
Disk IOPS	> 80% of device ceiling	State writes and block import will slow
Disk free	< 20%	Block and state growth can stall
Disk await	> 10ms	Block import latency is about to rise
Network bandwidth	> 50% of link capacity	Block propagation risk
RPC latency	> 100ms	Client-facing slowness

Consensus signals

IntentionBFT is a HotStuff-family protocol with canonical sequencing commitments. Validators expose:

Block production time. A design target is published per network; alert when the rolling average exceeds 1.5× that target.
Node voting rate. < 90% over the voting window is a P1; this indicates the node is missing votes and affects finality.
Finality state. Alert whenever finality stalls for more than one block interval.
Fork count. Any fork event is a P0 — investigate immediately.
Validator online count. Drop below $2f+1$ is catastrophic; drop below $2f+2$ is a P1 warning to avoid a cliff.
P2P connection count. Stay inside [20, 300]. Below 20 isolates the node; above 300 burns resources.

Trading engine signals

The headline engine-layer numbers validators watch are:

TPS (target 50K steady-state, 100K goal), with P1 at 25K and P0 at 10K.
Latency: p50 < 500ms, p99 < 1s.
Mempool pending < 50,000 (hard ceiling 100,000).
Matching queue < 25,000 (hard ceiling 50,000).

Alerting discipline

The standard stack is Prometheus + Alertmanager + Grafana, with optional log analytics for post-hoc investigation. Alerts are categorized:

P0 — service unavailable or safety-affecting. Pages the on-call immediately.
P1 — performance degradation or business-visible impact. Pages in working hours, escalates out-of-hours.
P2 — latent risk. Shows up on the dashboard; no page.

Related alerts are correlated so a single root cause does not fan out: for example, node_down suppresses high_cpu on the same host, and network_congestion correlates with latency_high into a single incident.

Edge-level defenses — connection limits, per-IP weight, address budgets, congestion fair-share — are described in DDoS and Sybil defense. For where metrics come from in the stack, see Network topology.

​Monitoring dimensions

Node resources

Consensus

Trading engine

Network and peers

​Node resources

​Consensus signals

​Trading engine signals

​Alerting discipline

Monitoring dimensions

Node resources

Consensus signals

Trading engine signals

Alerting discipline