Troubleshooting & Logs
This handbook is the single operational troubleshooting hub for PADAS: runtime diagnostics, incident triage, and Core/UI operational visibility in one place. Use it as a runtime troubleshooting playbook, incident response cheat sheet, and companion to runtime observability—not merely a “where log files live” appendix.
Related: Runtime configurations · Security · Monitoring · Testing · Control Tower · Cores · REST API Reference · Glossary
Operational troubleshooting strategy
Use this sequence during incident investigation to avoid random log tailing:
- Start with Monitoring — runtime telemetry, stream congestion, drops, EPS imbalance, and throughput imbalance surface faster than grepping
padas.log. - Confirm runtime state — Control Tower pipeline diagnostics and stage health show runtime state vs intent; catch deployment/runtime drift before deep dives.
- Verify deployment parity — Management → Pipelines / Cores for needs deploy, assign gaps, and stale topology when runtime verification on the graph disagrees with expectations.
- Inspect Core logs — after hypotheses exist: correlate timestamps with metrics windows and Monitoring rows (Core logs and observability).
- Validate runtime APIs —
GET /api/v1/status,GET /api/v1/metrics, targeted stream/task routes for runtime API validation and backpressure hints (REST API and Core runtime). - Replay traffic in Testing — replay investigation, retained replay datasets, and PDL diffing for malformed replay or silent filtering hypotheses.
- Escalate with the bundle described in When to open a ticket.
Runtime diagnostic toolkit
| Tool | Operational use |
|---|---|
| Monitoring | Live telemetry, stream congestion, connector saturation, runtime observability overlays; first stop for throughput imbalance and backpressure triage. |
| Testing | Replay investigation, transformation validation, retained replay after incidents; isolates silent filtering vs engine failure. |
| Control Tower | Pipeline diagnostics, stage health, throughput overlays—runtime state on the graph before raw log deep-dives. |
REST (/api/v1/*) | Runtime verification, auth/token health, consume/query probes, runtime API exposure checks from the same network path as automation. |
| Core logs | Deep runtime diagnostics when metrics and Monitoring narrow the blast radius. |
| UI logs | UI/API orchestration failures, session/auth issues, and correlation with browser-side runtime requests. |
Core logs and observability
[observability.logging] — production impact
Core logging is configured in padas.toml / shipped padas.default.toml (padas-core configs/padas.default.toml).
| Key | Default (reference) | Operational notes |
|---|---|---|
level | info | trace / debug increase formatting and I/O on hot paths—logging impacts throughput during storms; raise level only on focused reproduction hosts. |
format | Text | Json is preferred for aggregation, SIEM/SOAR pipelines, and structured logging workflows (field extraction, correlation IDs). |
file | ./var/log/padas.log | Resolved from PADAS_HOME unless absolute—size the filesystem for peaks, not averages. |
rotation | daily | Works with size caps—avoid unbounded single files during incidents. |
retention_days | 7 | Operational retention for on-host forensics; legal/compliance may require longer ship-to-object-store policies. |
max_file_size_bytes | 10485760 | Rotation under load—tune with EPS and burstiness. |
max_file_count | 10 | Trade disk vs incident investigation history. |
Stdout vs file: when file is omitted or only console appenders apply (per deployment), logs land on stdout—Kubernetes / systemd pipelines must capture it; do not assume padas.log exists without checking the unit or chart.
[observability.logging]
level = "info"
format = "Json"
file = "/var/lib/padas/var/log/padas.log"
rotation = "daily"
retention_days = 14
max_file_size_bytes = 10485760
max_file_count = 20
Master switch, metrics, and logs — together
Logs alone are insufficient for streaming runtime: pair Core logs with runtime telemetry.
[observability].enabledgates metrics, logging integration, and system streams (Runtime configurations — Observability).GET /api/v1/metrics— Prometheus text for throughput diagnostics, drops, and subsystem counters; scrape or curl during incidents for telemetry correlation with log windows.GET /api/v1/status— runtime verification snapshot (resources, embedded health signals); use before and after restarts to prove runtime state movement.
Workflow: anchor a Monitoring spike to a UTC window → pull /metrics for the same window → tail Json lines with matching timestamps → only then widen trace if still ambiguous.
Where to read logs (Linux / systemd)
journalctl -u padas-core(unit name per package) when the process also logs to journal.- Tail
fileon the Core host during reproduce; align timezones with SIEM for Json correlation.
UI registration and Core connectivity
Operational failures almost always trace to network path validation, TLS trust mismatch, container localhost confusion, or stale tokens—not “PADAS is down.” The UI server must open the same TLS/scheme/port your operators type in the browser.
| Symptom | Likely cause | Moves |
|---|---|---|
| Cannot save; TLS errors | TLS trust mismatch or blocked path from UI server to Core https://host:port | Fix Host to a route the UI runtime can open; install corporate CA; avoid laptop-only curl --insecure proofs for production decisions. |
| API errors after save | Core down, wrong Port, or stale tokens vs padas.toml | Runtime API reachability with GET /api/v1/status using the stored Bearer; refresh service-account.token and update the Core row. |
| Works from laptop, fails from UI | localhost means pod/container loopback, not the Core host | Re-test from the UI pod network namespace; use service DNS or LAN IP. |
| Token missing on create | Core never materialized data/security/service-account.token | Startup ordering: start Core once; confirm file under PADAS_HOME. |
Full registration semantics: Cores.
Quickstart: Core only (API bundle)
| Symptom | Investigation |
|---|---|
No [SINK] | Startup ordering: TCP sink helper listening on 8081 before bundle push; Core can route to 127.0.0.1 from its network namespace. |
| Source / port errors | Runtime reachability: port 8080 collisions; PADAS_TCP_SOURCE_* drift vs helper. |
| Low volume | Silent filtering: sample PDL action = "login" path drops most synthetic lines—not a silent engine stall. |
curl / push TLS failures | TLS handshake mismatch vs [api.tls]; Bearer missing vs [api.auth]; script self-signed retry vs production CA policy. |
Guide: Quickstart: Core only.
Quickstart: Core + UI (pipeline)
| Symptom | Investigation |
|---|---|
| UI cannot reach Core | Same as UI registration—plus firewall paths for management calls. |
| Sink stuck | Connector startup failures or helper not bound before deploy; routing mismatches between Core 127.0.0.1 and real sink host env. |
No [SINK] | Silent filtering + task not running on Home; allow wall time after deploy drift clears. |
| Import / assign errors | Deployment/runtime drift: leftover Core-only objects—delete bundle, reconcile Registry, retry clean. |
Guide: Quickstart: Core + UI.
REST API and Core runtime
| Symptom | Investigation |
|---|---|
401 storm | Missing/expired Bearer; max_auth_attempts auth lockouts per forwarded IP; wait lockout_duration_secs or fix proxy X-Forwarded-For. |
| TLS / TCP failures | TLS handshake mismatch, SNI, wrong port, or LB probes without Bearer under runtime API exposure. |
| Slow consumes | duration caps; consumer backlog; StreamRouter vs WAL path (retained stream lag); [core.subscriber.*] lag thresholds. |
| Replay latency | WAL read path vs router; large limit/offset jumps; subscriber tuning implications after config pushes. |
| WAL pressure / disk | retention_*, sync_writes, segment caps; aggregation cleanup; compare /metrics disk-related counters with df. |
Reference: REST API Reference.
Monitoring and pipeline playbooks
Treat this as an operational incident triage handbook layered on Monitoring—use Monitoring screenshots, row filters, and runtime metrics overlays as evidence.
| Incident pattern | Triage moves |
|---|---|
| Runaway drops | Sort by dropped events; identify stream congestion vs task; open Monitor on hottest stream; correlate EPS in vs out. |
| Backpressure investigation | Throughput imbalance across source → task → sink; check connector retry storms in logs after Monitoring narrows the hop. |
| Stalled downstream sink | Sink events out flat while upstream EPS stable—destination health, auth, blocked consumer downstream, or sink backpressure policy. |
| Blocked consumer | Consumer counts vs producer EPS; WAL fallback indicators; Registry stream attachments. |
| EPS collapse | Recent deploy clock; connectors not running; CPU saturation vs EPS imbalance (host telemetry charts in Monitoring). |
| Malformed replay | Monitor tail → narrow Query → export JSON; move to Testing for replay verification after incident. |
| Post-deploy regression | Compare system telemetry windows; Query canonical PDL; capture Testing diff. |
| WAL / retention issues | Query windows vs Registry WAL flags vs Streams definitions—prove runtime state vs intent. |
| Connector saturation | Connector EPS + error columns + CPU/network overlays; plan capacity or batching outside Monitoring if sustained. |
Control Tower (Home graph)
Control Tower gives graph state as operational hinting: broken edges, grey stages, or pipeline diagnostics that disagree with Monitoring often signal deployment/runtime drift before logs explain why.
- Throughput overlays help bottleneck localization along the pipeline without opening
padas.log. - Pipeline state may diverge from runtime health—registry intent can be green while stages error; trust stage badges first, then cross-check Management deploy.
- Use Control Tower before raw log deep-dives; escalate to Core logs and
/metricswhen the graph and Monitoring disagree or time-skew hides the fault.
UI server logs
Under PADAS_UI_HOME (padas-ui server/config/paths.js), var/log/ carries padas.out and rotated padas.%DATE%.log (names per packaging).
- UI API traces and auth/session debugging when the browser succeeds but Configurations/Management calls fail—correlate timestamps with Core windows.
- Frontend/backend operational correlation: pair UI log lines with the same UTC slice from Core
Jsonlogs and/metricsscrapes. - Operational retention: UI audit and access logs can grow; size
var/loglike Core log volumes.
When to open a ticket
Engineering reproduction expects a single bundle, not narrative alone:
padas.log/ journal excerpt with timezone-accurate timestamps and log level used during capture.GET /api/v1/statusand/metricssnapshots (redact secrets) covering the incident window—runtime API snapshots prove runtime verification state.- Monitoring table export or screenshots showing drops/EPS/stage context.
- Testing capture or replay dataset when replay investigation matters.
- WAL/runtime state: relevant Registry stream JSON and
padas.tomldeltas if deployment/runtime drift is suspected.
That package mirrors how runtime issues are bisected across UI ↔ Core boundaries in production operations.