Cloud Observability

ArcadeDB exposes a vendor-neutral observability stack for production (Kubernetes) deployments: always-on RED metrics, OpenTelemetry distributed tracing, structured JSON logging with correlation IDs, and standard health probes. Everything described here is opt-in and behavior-preserving — with no new configuration, an upgraded server behaves exactly as before. The existing /prometheus scrape endpoint, /api/v1/server JSON, and /api/v1/ready semantics are never changed, only added to.

The design follows an instrument-once principle: each hot path (HTTP request, query/command, transaction) is wrapped in a single Micrometer Observation. When only metrics are enabled, an Observation is a timer; when the optional tracing plugin is loaded, the same instrumentation also emits a span. A correlation ID then ties the resulting metric, span, and log line together.

Metrics depth

ArcadeDB builds on the Micrometer framework already in the server. Metrics are enabled by default (arcadedb.serverMetrics=true) and exposed on the Prometheus scrape endpoint /prometheus (see Monitoring for scrape and Grafana setup). The series below are additive — existing metrics and the /prometheus output shape are unchanged.

RED latency timers

Two always-on timers record request/query latency with a percentile histogram (server-side p50/p95/p99). Query text is never used as a tag, keeping cardinality bounded.

Timer Tags Prometheus series

Timer	Tags	Prometheus series
`arcadedb.http.requests`	`method`, `path` (route template, e.g. `/query/{database}` — never the raw URI), `status`, `db`	`arcadedb_http_requests_seconds_*`
`arcadedb.query.duration`	`protocol` (`http`, `bolt`, `postgres`, `mongo`, `grpc`, `redis`, `internal`), `db`, `language` (`sql`, `opencypher`, `gremlin`, `graphql`, `mongo`), `type` (`query` or `command`)	`arcadedb_query_duration_seconds_*`

arcadedb.http.requests

method, path (route template, e.g. /query/{database} — never the raw URI), status, db

arcadedb_http_requests_seconds_*

arcadedb.query.duration

protocol (http, bolt, postgres, mongo, grpc, redis, internal), db, language (sql, opencypher, gremlin, graphql, mongo), type (query or command)

arcadedb_query_duration_seconds_*

The protocol tag defaults to internal for embedded/in-process callers; dashboards typically filter protocol!="internal" to isolate client traffic. Redis commands and unfiltered Mongo scans do not pass through the query engine and are therefore not recorded by arcadedb.query.duration.

Engine gauges

EngineMetricsBinder bridges the internal engine Profiler into Micrometer gauges (engine-wide aggregates, refreshed from a one-second-memoized snapshot):

Gauge Meaning

Gauge	Meaning
`arcadedb.engine.page.cache.hits` / `.page.cache.misses`	Page cache hit/miss counters
`arcadedb.engine.pages.read` / `.pages.written`	Pages read from / written to disk
`arcadedb.engine.wal.bytes.written` / `.wal.files`	Write-ahead-log bytes and file count
`arcadedb.engine.mvcc.conflicts`	MVCC concurrent-modification conflicts
`arcadedb.engine.files.open`	Open file handles
`arcadedb.engine.tx.write` / `.tx.read` / `.tx.rollbacks`	Write/read transactions and rollbacks
`arcadedb.engine.queries` / `.commands`	Query and command counts
`arcadedb.engine.databases`	Number of open databases

arcadedb.engine.page.cache.hits / .page.cache.misses

Page cache hit/miss counters

arcadedb.engine.pages.read / .pages.written

Pages read from / written to disk

arcadedb.engine.wal.bytes.written / .wal.files

Write-ahead-log bytes and file count

arcadedb.engine.mvcc.conflicts

MVCC concurrent-modification conflicts

arcadedb.engine.files.open

Open file handles

arcadedb.engine.tx.write / .tx.read / .tx.rollbacks

Write/read transactions and rollbacks

arcadedb.engine.queries / .commands

Query and command counts

arcadedb.engine.databases

Number of open databases

Optional OTLP metrics export

To push metrics to an OpenTelemetry Collector (in addition to the unchanged /prometheus scrape), enable the OTLP registry:

arcadedb.serverMetrics.otlp.enabled=true
arcadedb.serverMetrics.otlp.endpoint=http://otel-collector:4317

This registers an OTLP meter registry alongside Prometheus; the scrape path is unaffected whether or not OTLP is enabled. The OTLP exporter lives in the optional metrics module bundled with the standard distributions.

Distributed tracing

Distributed tracing is an optional plugin (the tracing module, shipped only in the full distribution). When absent or disabled, the in-core Observation calls are no-ops with zero span overhead. When enabled, the same instrumentation points used by the metrics timers also emit OpenTelemetry spans.

Enable tracing and point it at your OTLP collector:

arcadedb.serverMetrics.tracing.enabled=true
arcadedb.serverMetrics.tracing.endpoint=http://otel-collector:4317
arcadedb.serverMetrics.tracing.samplingRate=0.1

tracing.endpoint is an OTLP/gRPC endpoint (default port 4317).
tracing.samplingRate is a parent-based ratio in [0.0, 1.0]: 1.0 samples every trace, 0.0 (default) disables sampling.
If the endpoint is unreachable, tracing self-disables and logs an error — server startup is never blocked.

Two span types are produced and nest correctly (query spans are children of the HTTP span within one trace):

Span Attributes

Span	Attributes
`arcadedb.http.server.requests`	`method`, `path` (route template), `db`, `status`
`arcadedb.query`	`protocol`, `db`, `language`, `type`, and `db.statement` (the full query/command text)

arcadedb.http.server.requests

method, path (route template), db, status

arcadedb.query

protocol, db, language, type, and db.statement (the full query/command text)

The db.statement span attribute contains the full query/command text, which may include sensitive data. Secure your OTLP collector endpoint and apply collector-side redaction if required.

ArcadeDB continues an upstream trace by reading the inbound W3C traceparent HTTP header. Resource attributes (service.name, service.instance.id), OTEL_RESOURCE_ATTRIBUTES, and trace-context propagation over the replication (Raft) and binary wire protocols are not yet implemented; each non-HTTP request starts a fresh root span.

Structured logging and correlation IDs

By default the server logs human-readable text, byte-identical to previous releases. Set arcadedb.server.logFormat=json to emit one JSON object per line instead, suitable for log shippers (Loki, ELK):

arcadedb.server.logFormat=json

See Logging for the configuration underneath this setting: log levels, rotation, log directory, and routing the engine through SLF4J when embedding.

Each JSON line contains: timestamp (local time, yyyy-MM-dd’T’HH:mm:ss.SSS), level, logger, thread, message, and — when present — requestId, db, traceId, spanId, and exception (full stack trace). Example:

{"timestamp":"2026-06-16T10:15:30.123","level":"INFO","logger":"com.arcadedb.server.ArcadeDBServer","thread":"http-1","message":"Query executed","requestId":"req-abc","db":"mydb","traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"0123456789abcdef"}

Correlation IDs

Every HTTP request is tagged with a requestId: ArcadeDB reuses the inbound X-Request-Id header when present (sanitized, capped at 128 characters) or generates a UUID, and echoes it on the response. The db field is taken from the request’s database path parameter. When the tracing plugin is active, traceId/spanId are added too — but requestId works for correlation even with tracing disabled.

For operators who keep text logs, arcadedb.server.logIncludeTrace=true appends ` [traceId=…]` to each line while a trace is active (default off, preserving current output):

arcadedb.server.logIncludeTrace=true

Health probes

ArcadeDB exposes two complementary, unauthenticated endpoints so orchestrators can tell a live process from a ready one. See the HTTP API reference for response details.

Endpoint Probe Behavior

Endpoint	Probe	Behavior
`GET /api/v1/health`	liveness	Returns `204` whenever the HTTP layer is up. Performs no database I/O and never returns a busy/`503` status, so a node that is merely warming up is not killed.
`GET /api/v1/ready`	readiness	Returns `204` once the server is `ONLINE`; `503` (`Server not started yet`) before that. Unchanged by default.

GET /api/v1/health

liveness

Returns 204 whenever the HTTP layer is up. Performs no database I/O and never returns a busy/503 status, so a node that is merely warming up is not killed.

GET /api/v1/ready

readiness

Returns 204 once the server is ONLINE; 503 (Server not started yet) before that. Unchanged by default.

Use /api/v1/health for the liveness probe and /api/v1/ready for the readiness probe:

livenessProbe:
  httpGet:
    path: /api/v1/health
    port: 2480
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /api/v1/ready
    port: 2480
  initialDelaySeconds: 20
  periodSeconds: 10

HA-aware readiness

On a High Availability cluster you can make readiness reflect Raft membership so traffic is withheld until a node has joined the group:

arcadedb.server.readinessRequiresHA=true

When this flag is true and HA is active, /api/v1/ready returns 503 (Node has not yet joined the Raft group) until a leader has been elected. Once a leader is known, a second gate checks that the node is a member of the current Raft configuration and, for a follower, that it has caught up to within arcadedb.server.readinessHAMaxLag log entries (default 100) of the commit index; until then it keeps returning 503 (Node is not yet in the Raft configuration or has not caught up). Only then does it return 204. This prevents a (re)joined follower with a wiped or lagging log from being marked ready mid-catch-up, which would otherwise let a rolling restart drop the write quorum. The default (false for readinessRequiresHA) preserves current readiness behavior. See Kubernetes deployment for the full StatefulSet example.