Skip to main content
Version: 2.0.0 (Latest)

Troubleshooting & Logs

This handbook is the single operational troubleshooting hub for PADAS: runtime diagnostics, incident triage, and Core/UI operational visibility in one place. Use it as a runtime troubleshooting playbook, incident response cheat sheet, and companion to runtime observability—not merely a “where log files live” appendix.

Related: Runtime configurations · Security · Monitoring · Testing · Control Tower · Cores · REST API Reference · Glossary


Operational troubleshooting strategy

Use this sequence during incident investigation to avoid random log tailing:

  1. Start with Monitoringruntime telemetry, stream congestion, drops, EPS imbalance, and throughput imbalance surface faster than grepping padas.log.
  2. Confirm runtime stateControl Tower pipeline diagnostics and stage health show runtime state vs intent; catch deployment/runtime drift before deep dives.
  3. Verify deployment parityManagement → Pipelines / Cores for needs deploy, assign gaps, and stale topology when runtime verification on the graph disagrees with expectations.
  4. Inspect Core logs — after hypotheses exist: correlate timestamps with metrics windows and Monitoring rows (Core logs and observability).
  5. Validate runtime APIsGET /api/v1/status, GET /api/v1/metrics, targeted stream/task routes for runtime API validation and backpressure hints (REST API and Core runtime).
  6. Replay traffic in Testingreplay investigation, retained replay datasets, and PDL diffing for malformed replay or silent filtering hypotheses.
  7. Escalate with the bundle described in When to open a ticket.

Runtime diagnostic toolkit

ToolOperational use
MonitoringLive telemetry, stream congestion, connector saturation, runtime observability overlays; first stop for throughput imbalance and backpressure triage.
TestingReplay investigation, transformation validation, retained replay after incidents; isolates silent filtering vs engine failure.
Control TowerPipeline diagnostics, stage health, throughput overlays—runtime state on the graph before raw log deep-dives.
REST (/api/v1/*)Runtime verification, auth/token health, consume/query probes, runtime API exposure checks from the same network path as automation.
Core logsDeep runtime diagnostics when metrics and Monitoring narrow the blast radius.
UI logsUI/API orchestration failures, session/auth issues, and correlation with browser-side runtime requests.

Core logs and observability

[observability.logging] — production impact

Core logging is configured in padas.toml / shipped padas.default.toml (padas-core configs/padas.default.toml).

KeyDefault (reference)Operational notes
levelinfotrace / debug increase formatting and I/O on hot paths—logging impacts throughput during storms; raise level only on focused reproduction hosts.
formatTextJson is preferred for aggregation, SIEM/SOAR pipelines, and structured logging workflows (field extraction, correlation IDs).
file./var/log/padas.logResolved from PADAS_HOME unless absolute—size the filesystem for peaks, not averages.
rotationdailyWorks with size caps—avoid unbounded single files during incidents.
retention_days7Operational retention for on-host forensics; legal/compliance may require longer ship-to-object-store policies.
max_file_size_bytes10485760Rotation under load—tune with EPS and burstiness.
max_file_count10Trade disk vs incident investigation history.

Stdout vs file: when file is omitted or only console appenders apply (per deployment), logs land on stdoutKubernetes / systemd pipelines must capture it; do not assume padas.log exists without checking the unit or chart.

[observability.logging]
level = "info"
format = "Json"
file = "/var/lib/padas/var/log/padas.log"
rotation = "daily"
retention_days = 14
max_file_size_bytes = 10485760
max_file_count = 20

Master switch, metrics, and logs — together

Logs alone are insufficient for streaming runtime: pair Core logs with runtime telemetry.

  • [observability].enabled gates metrics, logging integration, and system streams (Runtime configurations — Observability).
  • GET /api/v1/metrics — Prometheus text for throughput diagnostics, drops, and subsystem counters; scrape or curl during incidents for telemetry correlation with log windows.
  • GET /api/v1/statusruntime verification snapshot (resources, embedded health signals); use before and after restarts to prove runtime state movement.

Workflow: anchor a Monitoring spike to a UTC window → pull /metrics for the same window → tail Json lines with matching timestamps → only then widen trace if still ambiguous.

Where to read logs (Linux / systemd)

  • journalctl -u padas-core (unit name per package) when the process also logs to journal.
  • Tail file on the Core host during reproduce; align timezones with SIEM for Json correlation.

UI registration and Core connectivity

Operational failures almost always trace to network path validation, TLS trust mismatch, container localhost confusion, or stale tokens—not “PADAS is down.” The UI server must open the same TLS/scheme/port your operators type in the browser.

SymptomLikely causeMoves
Cannot save; TLS errorsTLS trust mismatch or blocked path from UI server to Core https://host:portFix Host to a route the UI runtime can open; install corporate CA; avoid laptop-only curl --insecure proofs for production decisions.
API errors after saveCore down, wrong Port, or stale tokens vs padas.tomlRuntime API reachability with GET /api/v1/status using the stored Bearer; refresh service-account.token and update the Core row.
Works from laptop, fails from UIlocalhost means pod/container loopback, not the Core hostRe-test from the UI pod network namespace; use service DNS or LAN IP.
Token missing on createCore never materialized data/security/service-account.tokenStartup ordering: start Core once; confirm file under PADAS_HOME.

Full registration semantics: Cores.


Quickstart: Core only (API bundle)

SymptomInvestigation
No [SINK]Startup ordering: TCP sink helper listening on 8081 before bundle push; Core can route to 127.0.0.1 from its network namespace.
Source / port errorsRuntime reachability: port 8080 collisions; PADAS_TCP_SOURCE_* drift vs helper.
Low volumeSilent filtering: sample PDL action = "login" path drops most synthetic lines—not a silent engine stall.
curl / push TLS failuresTLS handshake mismatch vs [api.tls]; Bearer missing vs [api.auth]; script self-signed retry vs production CA policy.

Guide: Quickstart: Core only.


Quickstart: Core + UI (pipeline)

SymptomInvestigation
UI cannot reach CoreSame as UI registration—plus firewall paths for management calls.
Sink stuckConnector startup failures or helper not bound before deploy; routing mismatches between Core 127.0.0.1 and real sink host env.
No [SINK]Silent filtering + task not running on Home; allow wall time after deploy drift clears.
Import / assign errorsDeployment/runtime drift: leftover Core-only objects—delete bundle, reconcile Registry, retry clean.

Guide: Quickstart: Core + UI.


REST API and Core runtime

SymptomInvestigation
401 stormMissing/expired Bearer; max_auth_attempts auth lockouts per forwarded IP; wait lockout_duration_secs or fix proxy X-Forwarded-For.
TLS / TCP failuresTLS handshake mismatch, SNI, wrong port, or LB probes without Bearer under runtime API exposure.
Slow consumesduration caps; consumer backlog; StreamRouter vs WAL path (retained stream lag); [core.subscriber.*] lag thresholds.
Replay latencyWAL read path vs router; large limit/offset jumps; subscriber tuning implications after config pushes.
WAL pressure / diskretention_*, sync_writes, segment caps; aggregation cleanup; compare /metrics disk-related counters with df.

Reference: REST API Reference.


Monitoring and pipeline playbooks

Treat this as an operational incident triage handbook layered on Monitoring—use Monitoring screenshots, row filters, and runtime metrics overlays as evidence.

Incident patternTriage moves
Runaway dropsSort by dropped events; identify stream congestion vs task; open Monitor on hottest stream; correlate EPS in vs out.
Backpressure investigationThroughput imbalance across source → task → sink; check connector retry storms in logs after Monitoring narrows the hop.
Stalled downstream sinkSink events out flat while upstream EPS stable—destination health, auth, blocked consumer downstream, or sink backpressure policy.
Blocked consumerConsumer counts vs producer EPS; WAL fallback indicators; Registry stream attachments.
EPS collapseRecent deploy clock; connectors not running; CPU saturation vs EPS imbalance (host telemetry charts in Monitoring).
Malformed replayMonitor tail → narrow Query → export JSON; move to Testing for replay verification after incident.
Post-deploy regressionCompare system telemetry windows; Query canonical PDL; capture Testing diff.
WAL / retention issuesQuery windows vs Registry WAL flags vs Streams definitions—prove runtime state vs intent.
Connector saturationConnector EPS + error columns + CPU/network overlays; plan capacity or batching outside Monitoring if sustained.

Control Tower (Home graph)

Control Tower gives graph state as operational hinting: broken edges, grey stages, or pipeline diagnostics that disagree with Monitoring often signal deployment/runtime drift before logs explain why.

  • Throughput overlays help bottleneck localization along the pipeline without opening padas.log.
  • Pipeline state may diverge from runtime health—registry intent can be green while stages error; trust stage badges first, then cross-check Management deploy.
  • Use Control Tower before raw log deep-dives; escalate to Core logs and /metrics when the graph and Monitoring disagree or time-skew hides the fault.

UI server logs

Under PADAS_UI_HOME (padas-ui server/config/paths.js), var/log/ carries padas.out and rotated padas.%DATE%.log (names per packaging).

  • UI API traces and auth/session debugging when the browser succeeds but Configurations/Management calls fail—correlate timestamps with Core windows.
  • Frontend/backend operational correlation: pair UI log lines with the same UTC slice from Core Json logs and /metrics scrapes.
  • Operational retention: UI audit and access logs can grow; size var/log like Core log volumes.

When to open a ticket

Engineering reproduction expects a single bundle, not narrative alone:

  • padas.log / journal excerpt with timezone-accurate timestamps and log level used during capture.
  • GET /api/v1/status and /metrics snapshots (redact secrets) covering the incident window—runtime API snapshots prove runtime verification state.
  • Monitoring table export or screenshots showing drops/EPS/stage context.
  • Testing capture or replay dataset when replay investigation matters.
  • WAL/runtime state: relevant Registry stream JSON and padas.toml deltas if deployment/runtime drift is suspected.

That package mirrors how runtime issues are bisected across UI ↔ Core boundaries in production operations.