Skip to content

Observability

ResilientDNS follows a logs-first approach: logs capture behavior, and metrics provide low-overhead trend visibility.

Metrics HTTP Endpoint

The metrics endpoint is a small, dependency-free HTTP server that exposes read-only counters for operational insight.

For safety, the default bind address is 127.0.0.1.

Enabling the endpoint

resilientdns \
  --metrics-host 127.0.0.1 \
  --metrics-port 9100

Endpoints

  • GET /metrics: plain text lines of name value, sorted by name
  • GET /healthz: returns ok
  • GET /readyz: returns ok when ready; otherwise 503
  • GET /cache/stats: JSON cache statistics
  • Any other path: 404 not found

Example response:

cache_entries 42
evictions_total 3

Metrics Semantics

Metric Meaning
cache_entries Current cache size (gauge).
evictions_total Entries evicted due to capacity enforcement.
dropped_total Packets or responses dropped due to capacity/size limits.
malformed_total Malformed DNS packets observed.

Refresh Metrics

Metric Meaning
swr_refresh_triggered_total Stale-while-revalidate refreshes started due to serving stale.
cache_refresh_enqueued_total Background refresh jobs enqueued.
cache_refresh_dropped_total{reason="queue_full"} Refresh enqueue dropped because the queue is full.
cache_refresh_dropped_total{reason="duplicate"} Refresh enqueue dropped due to dedupe.
cache_refresh_started_total Refresh jobs started by workers.
cache_refresh_completed_total{result="success"} Refresh jobs that completed successfully.
cache_refresh_completed_total{result="fail"} Refresh jobs that failed (no retries).
cache_refresh_completed_total{result="skipped"} Refresh jobs skipped without an upstream attempt.

Upstream Metrics Semantics

  • upstream_requests_total: number of actual upstream attempts (after inflight admission).
  • dropped_total: requests dropped due to policy or saturation (e.g. max_inflight, oversize responses). Drops are not upstream failures.
  • upstream_udp_errors_total: UDP upstream failures (timeouts or exceptions after an attempt was made).
  • upstream_tcp_errors_total: TCP upstream failures (connect, read/write errors, protocol violations, oversize drops).
  • upstream_tcp_reuses_total: number of times an existing TCP upstream connection was reused from the pool.
  • upstream_relay_requests_total: relay upstream HTTP requests.
  • upstream_relay_http_4xx_total: relay HTTP 4xx responses.
  • upstream_relay_http_5xx_total: relay HTTP 5xx responses.
  • upstream_relay_timeouts_total: relay timeouts.
  • upstream_relay_client_errors_total: relay HTTP client/transport/request failures (connect/timeouts/TLS/non-2xx/decode failures/etc.).
  • upstream_relay_protocol_errors_total: relay response shape/contract violations after a successful HTTP exchange and JSON decoding.

Client errors reflect transport-side failures before a valid Relay response is received (including non-2xx or decode failures). Protocol errors indicate the Relay responded with 200 and valid JSON, but the response was invalid or incompatible.

A request can be dropped without being an upstream error. Errors imply an upstream attempt was made.

Reasoned Drop and Error Metrics

These counters are additive and do not replace existing totals.

  • dropped_max_inflight_total: drops due to admission limits.
  • dropped_oversize_total: drops due to oversize requests or responses.
  • dropped_malformed_total: drops due to malformed DNS packets.
  • dropped_policy_total: drops due to other policy enforcement.
  • upstream_udp_timeouts_total: UDP upstream timeouts.
  • upstream_tcp_timeouts_total: TCP upstream timeouts.
  • upstream_tcp_connect_errors_total: TCP connection failures.
  • upstream_tcp_protocol_errors_total: TCP read/write/protocol errors and oversize protocol drops.

Tuning Guidance

  • Set max_inflight to protect upstreams under burst load and to bound concurrent work.
  • A high dropped_total with low *_errors_total usually indicates admission policy pressure, not upstream instability.
  • A high *_errors_total indicates upstream failures after an attempt was made and should be investigated separately.
  • Use upstream_tcp_reuses_total to evaluate TCP pool effectiveness; higher reuse generally means fewer connects.
  • Start with small limits and tune explicitly for unreliable networks to keep failure modes predictable.

Readiness vs Liveness

  • /healthz is liveness only and returns 200 if the process is running.
  • /readyz is readiness and returns 200 only after DNS listeners and the metrics server are ready.

Additional Metrics

  • resilientdns_build_info{version="<package_version>"}: constant build label for the running version.
  • resilientdns_uptime_seconds: process uptime in seconds (monotonic).

Cache Clear (SIGHUP)

  • Sending SIGHUP clears the in-memory cache without stopping the server.
  • An INFO log line is emitted on clear.
  • cache_clears_total increments on each clear.

Design Principles

  • Read-only endpoint
  • Fail-safe behavior (errors drop requests without side effects)
  • Low overhead
  • Deterministic output ordering
  • Explicit enablement

Operational Notes

  • Bind to 127.0.0.1 by default
  • Consider firewall rules when exposing externally
  • Use a local scraper or sidecar to collect metrics