Observability
ResilientDNS follows a logs-first approach: logs capture behavior, and metrics provide low-overhead trend visibility.
Metrics HTTP Endpoint
The metrics endpoint is a small, dependency-free HTTP server that exposes read-only counters for operational insight.
For safety, the default bind address is 127.0.0.1.
Enabling the endpoint
resilientdns \
--metrics-host 127.0.0.1 \
--metrics-port 9100
Endpoints
- GET
/metrics: plain text lines ofname value, sorted by name - GET
/healthz: returnsok - GET
/readyz: returnsokwhen ready; otherwise 503 - GET
/cache/stats: JSON cache statistics - Any other path: 404 not found
Example response:
cache_entries 42
evictions_total 3
Metrics Semantics
| Metric | Meaning |
|---|---|
cache_entries |
Current cache size (gauge). |
evictions_total |
Entries evicted due to capacity enforcement. |
dropped_total |
Packets or responses dropped due to capacity/size limits. |
malformed_total |
Malformed DNS packets observed. |
Refresh Metrics
| Metric | Meaning |
|---|---|
swr_refresh_triggered_total |
Stale-while-revalidate refreshes started due to serving stale. |
cache_refresh_enqueued_total |
Background refresh jobs enqueued. |
cache_refresh_dropped_total{reason="queue_full"} |
Refresh enqueue dropped because the queue is full. |
cache_refresh_dropped_total{reason="duplicate"} |
Refresh enqueue dropped due to dedupe. |
cache_refresh_started_total |
Refresh jobs started by workers. |
cache_refresh_completed_total{result="success"} |
Refresh jobs that completed successfully. |
cache_refresh_completed_total{result="fail"} |
Refresh jobs that failed (no retries). |
cache_refresh_completed_total{result="skipped"} |
Refresh jobs skipped without an upstream attempt. |
Upstream Metrics Semantics
upstream_requests_total: number of actual upstream attempts (after inflight admission).dropped_total: requests dropped due to policy or saturation (e.g.max_inflight, oversize responses). Drops are not upstream failures.upstream_udp_errors_total: UDP upstream failures (timeouts or exceptions after an attempt was made).upstream_tcp_errors_total: TCP upstream failures (connect, read/write errors, protocol violations, oversize drops).upstream_tcp_reuses_total: number of times an existing TCP upstream connection was reused from the pool.upstream_relay_requests_total: relay upstream HTTP requests.upstream_relay_http_4xx_total: relay HTTP 4xx responses.upstream_relay_http_5xx_total: relay HTTP 5xx responses.upstream_relay_timeouts_total: relay timeouts.upstream_relay_client_errors_total: relay HTTP client/transport/request failures (connect/timeouts/TLS/non-2xx/decode failures/etc.).upstream_relay_protocol_errors_total: relay response shape/contract violations after a successful HTTP exchange and JSON decoding.
Client errors reflect transport-side failures before a valid Relay response is received (including non-2xx or decode failures). Protocol errors indicate the Relay responded with 200 and valid JSON, but the response was invalid or incompatible.
A request can be dropped without being an upstream error. Errors imply an upstream attempt was made.
Reasoned Drop and Error Metrics
These counters are additive and do not replace existing totals.
dropped_max_inflight_total: drops due to admission limits.dropped_oversize_total: drops due to oversize requests or responses.dropped_malformed_total: drops due to malformed DNS packets.dropped_policy_total: drops due to other policy enforcement.upstream_udp_timeouts_total: UDP upstream timeouts.upstream_tcp_timeouts_total: TCP upstream timeouts.upstream_tcp_connect_errors_total: TCP connection failures.upstream_tcp_protocol_errors_total: TCP read/write/protocol errors and oversize protocol drops.
Tuning Guidance
- Set
max_inflightto protect upstreams under burst load and to bound concurrent work. - A high
dropped_totalwith low*_errors_totalusually indicates admission policy pressure, not upstream instability. - A high
*_errors_totalindicates upstream failures after an attempt was made and should be investigated separately. - Use
upstream_tcp_reuses_totalto evaluate TCP pool effectiveness; higher reuse generally means fewer connects. - Start with small limits and tune explicitly for unreliable networks to keep failure modes predictable.
Readiness vs Liveness
/healthzis liveness only and returns 200 if the process is running./readyzis readiness and returns 200 only after DNS listeners and the metrics server are ready.
Additional Metrics
resilientdns_build_info{version="<package_version>"}: constant build label for the running version.resilientdns_uptime_seconds: process uptime in seconds (monotonic).
Cache Clear (SIGHUP)
- Sending
SIGHUPclears the in-memory cache without stopping the server. - An INFO log line is emitted on clear.
cache_clears_totalincrements on each clear.
Design Principles
- Read-only endpoint
- Fail-safe behavior (errors drop requests without side effects)
- Low overhead
- Deterministic output ordering
- Explicit enablement
Operational Notes
- Bind to
127.0.0.1by default - Consider firewall rules when exposing externally
- Use a local scraper or sidecar to collect metrics