Version: 2.0.0 (Latest)

Troubleshooting & Logs

This handbook is the single operational troubleshooting hub for PADAS: runtime diagnostics, incident triage, and Core/UI operational visibility in one place. Use it as a runtime troubleshooting playbook, incident response cheat sheet, and companion to runtime observability—not merely a “where log files live” appendix.

Related: Runtime configurations · Security · Monitoring · Testing · Control Tower · Cores · REST API Reference · Glossary

Operational troubleshooting strategy

Use this sequence during incident investigation to avoid random log tailing:

Start with Monitoring — runtime telemetry, stream congestion, drops, EPS imbalance, and throughput imbalance surface faster than grepping padas.log.
Confirm runtime state — Control Tower pipeline diagnostics and stage health show runtime state vs intent; catch deployment/runtime drift before deep dives.
Verify deployment parity — Management → Pipelines / Cores for needs deploy, assign gaps, and stale topology when runtime verification on the graph disagrees with expectations.
Inspect Core logs — after hypotheses exist: correlate timestamps with metrics windows and Monitoring rows (Core logs and observability).
Validate runtime APIs — GET /api/v1/status, GET /api/v1/metrics, targeted stream/task routes for runtime API validation and backpressure hints (REST API and Core runtime).
Replay traffic in Testing — replay investigation, retained replay datasets, and PDL diffing for malformed replay or silent filtering hypotheses.
Escalate with the bundle described in When to open a ticket.

Runtime diagnostic toolkit

Tool	Operational use
Monitoring	Live telemetry, stream congestion, connector saturation, runtime observability overlays; first stop for throughput imbalance and backpressure triage.
Testing	Replay investigation, transformation validation, retained replay after incidents; isolates silent filtering vs engine failure.
Control Tower	Pipeline diagnostics, stage health, throughput overlays—runtime state on the graph before raw log deep-dives.
*REST (`/api/v1/`)**	Runtime verification, auth/token health, consume/query probes, runtime API exposure checks from the same network path as automation.
Core logs	Deep runtime diagnostics when metrics and Monitoring narrow the blast radius.
UI logs	UI/API orchestration failures, session/auth issues, and correlation with browser-side runtime requests.

Core logs and observability

`[observability.logging]` — production impact

Core logging is configured in padas.toml / shipped padas.default.toml (padas-core configs/padas.default.toml).

Key	Default (reference)	Operational notes
`level`	`info`	`trace` / `debug` increase formatting and I/O on hot paths—logging impacts throughput during storms; raise level only on focused reproduction hosts.
`format`	`Text`	`Json` is preferred for aggregation, SIEM/SOAR pipelines, and structured logging workflows (field extraction, correlation IDs).
`file`	`./var/log/padas.log`	Resolved from `PADAS_HOME` unless absolute—size the filesystem for peaks, not averages.
`rotation`	`daily`	Works with size caps—avoid unbounded single files during incidents.
`retention_days`	`7`	Operational retention for on-host forensics; legal/compliance may require longer ship-to-object-store policies.
`max_file_size_bytes`	`10485760`	Rotation under load—tune with EPS and burstiness.
`max_file_count`	`10`	Trade disk vs incident investigation history.

Stdout vs file: when file is omitted or only console appenders apply (per deployment), logs land on stdout—Kubernetes / systemd pipelines must capture it; do not assume padas.log exists without checking the unit or chart.

[observability.logging]
level = "info"
format = "Json"
file = "/var/lib/padas/var/log/padas.log"
rotation = "daily"
retention_days = 14
max_file_size_bytes = 10485760
max_file_count = 20

Master switch, metrics, and logs — together

Logs alone are insufficient for streaming runtime: pair Core logs with runtime telemetry.

[observability].enabled gates metrics, logging integration, and system streams (Runtime configurations — Observability).
GET /api/v1/metrics — Prometheus text for throughput diagnostics, drops, and subsystem counters; scrape or curl during incidents for telemetry correlation with log windows.
GET /api/v1/status — runtime verification snapshot (resources, embedded health signals); use before and after restarts to prove runtime state movement.

Workflow: anchor a Monitoring spike to a UTC window → pull /metrics for the same window → tail Json lines with matching timestamps → only then widen trace if still ambiguous.

Where to read logs (Linux / systemd)

journalctl -u padas-core (unit name per package) when the process also logs to journal.
Tail file on the Core host during reproduce; align timezones with SIEM for Json correlation.

UI registration and Core connectivity

Operational failures almost always trace to network path validation, TLS trust mismatch, container localhost confusion, or stale tokens—not “PADAS is down.” The UI server must open the same TLS/scheme/port your operators type in the browser.

Symptom	Likely cause	Moves
Cannot save; TLS errors	TLS trust mismatch or blocked path from UI server to Core `https://host:port`	Fix Host to a route the UI runtime can open; install corporate CA; avoid laptop-only `curl --insecure` proofs for production decisions.
API errors after save	Core down, wrong Port, or stale tokens vs `padas.toml`	Runtime API reachability with `GET /api/v1/status` using the stored Bearer; refresh `service-account.token` and update the Core row.
Works from laptop, fails from UI	`localhost` means pod/container loopback, not the Core host	Re-test from the UI pod network namespace; use service DNS or LAN IP.
Token missing on create	Core never materialized `data/security/service-account.token`	Startup ordering: start Core once; confirm file under `PADAS_HOME`.

Full registration semantics: Cores.

Quickstart: Core only (API bundle)

Symptom	Investigation
No `[SINK]`	Startup ordering: TCP sink helper listening on `8081` before bundle push; Core can route to `127.0.0.1` from its network namespace.
Source / port errors	Runtime reachability: port 8080 collisions; *`PADAS_TCP_SOURCE_`** drift vs helper.
Low volume	Silent filtering: sample PDL `action = "login"` path drops most synthetic lines—not a silent engine stall.
`curl` / push TLS failures	TLS handshake mismatch vs `[api.tls]`; Bearer missing vs `[api.auth]`; script self-signed retry vs production CA policy.

Guide: Quickstart: Core only.

Quickstart: Core + UI (pipeline)

Symptom	Investigation
UI cannot reach Core	Same as UI registration—plus firewall paths for management calls.
Sink stuck	Connector startup failures or helper not bound before deploy; routing mismatches between Core `127.0.0.1` and real sink host env.
No `[SINK]`	Silent filtering + task not running on Home; allow wall time after deploy drift clears.
Import / assign errors	Deployment/runtime drift: leftover Core-only objects—delete bundle, reconcile Registry, retry clean.

Guide: Quickstart: Core + UI.

REST API and Core runtime

Symptom	Investigation
`401` storm	Missing/expired Bearer; `max_auth_attempts` auth lockouts per forwarded IP; wait `lockout_duration_secs` or fix proxy `X-Forwarded-For`.
TLS / TCP failures	TLS handshake mismatch, SNI, wrong port, or LB probes without Bearer under runtime API exposure.
Slow consumes	`duration` caps; consumer backlog; StreamRouter vs WAL path (retained stream lag); *`[core.subscriber.]`** lag thresholds.
Replay latency	WAL read path vs router; large `limit`/`offset` jumps; subscriber tuning implications after config pushes.
WAL pressure / disk	*`retention_`, `sync_writes`, segment caps; aggregation cleanup; compare `/metrics`** disk-related counters with `df`.

Reference: REST API Reference.

Monitoring and pipeline playbooks

Treat this as an operational incident triage handbook layered on Monitoring—use Monitoring screenshots, row filters, and runtime metrics overlays as evidence.

Incident pattern	Triage moves
Runaway drops	Sort by dropped events; identify stream congestion vs task; open Monitor on hottest stream; correlate EPS in vs out.
Backpressure investigation	Throughput imbalance across source → task → sink; check connector retry storms in logs after Monitoring narrows the hop.
Stalled downstream sink	Sink events out flat while upstream EPS stable—destination health, auth, blocked consumer downstream, or sink backpressure policy.
Blocked consumer	Consumer counts vs producer EPS; WAL fallback indicators; Registry stream attachments.
EPS collapse	Recent deploy clock; connectors not running; CPU saturation vs EPS imbalance (host telemetry charts in Monitoring).
Malformed replay	Monitor tail → narrow Query → export JSON; move to Testing for replay verification after incident.
Post-deploy regression	Compare system telemetry windows; Query canonical PDL; capture Testing diff.
WAL / retention issues	Query windows vs Registry WAL flags vs Streams definitions—prove runtime state vs intent.
Connector saturation	Connector EPS + error columns + CPU/network overlays; plan capacity or batching outside Monitoring if sustained.

Control Tower (Home graph)

Control Tower gives graph state as operational hinting: broken edges, grey stages, or pipeline diagnostics that disagree with Monitoring often signal deployment/runtime drift before logs explain why.

Throughput overlays help bottleneck localization along the pipeline without opening padas.log.
Pipeline state may diverge from runtime health—registry intent can be green while stages error; trust stage badges first, then cross-check Management deploy.
Use Control Tower before raw log deep-dives; escalate to Core logs and /metrics when the graph and Monitoring disagree or time-skew hides the fault.

UI server logs

Under PADAS_UI_HOME (padas-ui server/config/paths.js), var/log/ carries padas.out and rotated padas.%DATE%.log (names per packaging).

UI API traces and auth/session debugging when the browser succeeds but Configurations/Management calls fail—correlate timestamps with Core windows.
Frontend/backend operational correlation: pair UI log lines with the same UTC slice from Core Json logs and /metrics scrapes.
Operational retention: UI audit and access logs can grow; size var/log like Core log volumes.

When to open a ticket

Engineering reproduction expects a single bundle, not narrative alone:

padas.log / journal excerpt with timezone-accurate timestamps and log level used during capture.
GET /api/v1/status and /metrics snapshots (redact secrets) covering the incident window—runtime API snapshots prove runtime verification state.
Monitoring table export or screenshots showing drops/EPS/stage context.
Testing capture or replay dataset when replay investigation matters.
WAL/runtime state: relevant Registry stream JSON and padas.toml deltas if deployment/runtime drift is suspected.

That package mirrors how runtime issues are bisected across UI ↔ Core boundaries in production operations.

Operational troubleshooting strategy​

Runtime diagnostic toolkit​

Core logs and observability​

[observability.logging] — production impact​

Master switch, metrics, and logs — together​

Where to read logs (Linux / systemd)​

UI registration and Core connectivity​

Quickstart: Core only (API bundle)​

Quickstart: Core + UI (pipeline)​

REST API and Core runtime​

Monitoring and pipeline playbooks​

Control Tower (Home graph)​

UI server logs​

When to open a ticket​