Tuning Guide¶

This page helps operators systematically reduce false positives, recover missed detections, and control resource usage. Start with the decision flowchart to identify which lever to pull, then consult the relevant section for parameter details.

Decision Flowchart¶

Symptom	First thing to try	Section
Too many alerts — mostly false positives	Raise `dspot.risk_level` or use `seerflow feedback <id> fp`	False Positives
Too many alerts — correct but noisy	Increase `dedup_window_seconds` or lower detector weights	Dedup & Weights
Too few alerts	Lower the relevant detector threshold	Detector Tuning
Wrong alerts — correlation misfires	Adjust `window_duration_seconds` or `late_tolerance_seconds`	Correlation Tuning
Wrong alerts — wrong detector emphasis	Rebalance `weights_*` parameters	Detector Tuning
High memory / CPU	Tune LRU caps and `score_interval`	Performance

flowchart TD
    Start[Alert volume feels wrong] --> TooMany{Too many or too few?}
    TooMany -->|Too many| Quality{Mostly correct<br/>but noisy?}
    TooMany -->|Too few| Detector[Lower relevant detector<br/>threshold]
    Quality -->|Noisy but correct| Dedup[Increase<br/>dedup_window_seconds<br/>or lower detector weights]
    Quality -->|Mostly false positives| FPFlow{Which lever?}
    FPFlow --> DspotRisk[Raise dspot.risk_level<br/>from 0.0001 to 0.001]
    FPFlow --> Feedback[Use seerflow feedback fp<br/>for per-entity adjustment]
    Start --> WrongKind{Wrong alerts?}
    WrongKind -->|Correlation groups wrong| Correlation[Tune window_duration_seconds<br/>and late_tolerance_seconds]
    WrongKind -->|Wrong detector fires| Weights[Rebalance weights_*<br/>parameters]
    Start --> Performance{High memory<br/>or CPU?}
    Performance -->|Yes| PerfTune[Tune LRU caps<br/>and score_interval]

    classDef action fill:#402aa1,stroke:#402aa1,color:#fff
    class Detector,Dedup,DspotRisk,Feedback,Correlation,Weights,PerfTune action

Flowchart Walkthrough¶

Too Many Alerts — Mostly False Positives¶

The DSPOT algorithm sets anomaly thresholds automatically using extreme-value theory. Its sensitivity is controlled by detection.dspot.risk_level, which is the tail-probability cutoff (default 0.0001, meaning 1-in-10,000 chance of a legitimate value exceeding the threshold). Raising this to 0.001 or higher makes the threshold more permissive, cutting false positives at the cost of slightly reduced recall.

detection:
  dspot:
    risk_level: 0.001   # was 0.0001 — 10x more permissive

For sustained improvement without manual threshold tinkering, use the operator feedback CLI. Marking an alert as a false positive nudges the affected detector's threshold upward by 5% for that entity:

seerflow feedback <alert-id> fp

Repeated feedback compounds: three FP marks on the same entity roughly doubles the threshold (1.05³ ≈ 1.16×). The adjustment persists across restarts because model state is saved to disk every detection.model_save_interval_seconds seconds (default 300 s).

Too Many Alerts — Not False Positives, Just Noisy¶

When alerts are technically correct but operationally overwhelming (for example, a single flapping service triggering dozens of alerts), the first lever is alert deduplication. The default deduplication window is 900 seconds (15 minutes): any alert with the same dedup key within that window is suppressed.

alerting:
  dedup_window_seconds: 1800   # extend to 30 minutes globally

For per-type control without changing the global default, use dedup_window_overrides:

alerting:
  dedup_window_overrides:
    ssh_brute_force: 3600        # 1 hour for brute-force alerts
    disk_usage_high: 300         # 5 minutes for disk alerts

If noise comes from a specific detector producing high scores, reduce its blending weight. Weights are relative — only their ratios matter because the pipeline divides each weight by the sum:

detection:
  weights_volume: 0.10    # was 0.25 — halve volume detector influence
  weights_content: 0.40   # was 0.30 — compensate with content weight

Too Few Alerts¶

The most targeted fix is to identify which detector class is responsible for the events you are missing, then lower its sensitivity threshold. See the Detector Tuning section below and the per-detector deep-dive pages for details.

Right Volume, Wrong Alerts — Correlation Issues¶

When individual detector scores look reasonable but correlated alerts are incorrect (for example, grouping unrelated events together, or splitting a real incident across multiple alerts), the problem is usually in the correlation time window or entity late-arrival tolerance.

Increasing correlation.window_duration_seconds (default 1800 s) allows more events to be grouped into the same incident. Increasing correlation.late_tolerance_seconds (default 30 s) accommodates clock skew between log sources.

correlation:
  window_duration_seconds: 3600   # extend to 1 hour
  late_tolerance_seconds: 120     # tolerate up to 2 minutes of clock skew

If the grouping logic is sound but the wrong detectors are driving the final score, rebalance the weights_* parameters as described above.

Detector Tuning¶

The table below lists common tuning goals with the exact parameter to change and the expected effect. All parameters live under the detection: YAML key.

Goal	Parameter	Direction	Effect
Catch subtle content anomalies	`hst_window_size`	Lower (e.g. 500)	Smaller reference window — HST adapts faster but may increase FPs
Reduce HST sensitivity on stable sources	`hst_window_size`	Raise (e.g. 2000)	Larger reference window — more stable baseline, fewer FPs
Tighten volume spike detection	`hw_n_std`	Lower (e.g. 2.0)	Narrower normal band — fires on smaller volume changes
Reduce volume alert noise	`hw_n_std`	Raise (e.g. 4.0)	Wider normal band — only fires on large spikes
Detect gradual drift / slow mean shift	`cusum_drift`	Lower (e.g. 0.2)	More sensitive to small persistent shifts
Score sequences with sparse data sooner	`markov_min_events`	Lower (e.g. 50)	Starts scoring after fewer observed events
Prevent noisy DSPOT thresholds early on	`dspot.calibration_window`	Raise (e.g. 2000)	Longer calibration phase before thresholds activate

For detailed parameter semantics and worked examples, see the per-detector pages:

Correlation Tuning¶

Parameters under correlation: and detection.kill_chain / detection.risk_* control how events are grouped into incidents and how entity risk accumulates over time.

Parameter	Default	Tuning Advice
`correlation.window_duration_seconds`	`1800`	Increase (up to 7200) for slow-moving attacks; decrease for high-throughput environments where grouping should be tighter
`correlation.max_events_per_entity`	`1000`	Lower to reduce memory per active entity; raise if legitimate bursts are being truncated
`correlation.max_entities`	`10000`	Sets the LRU cap for active entity windows; lower in memory-constrained environments
`correlation.late_tolerance_seconds`	`30`	Raise to 120–300 for distributed systems with significant clock skew
`detection.kill_chain.tactic_threshold`	`3`	Minimum distinct ATT&CK tactics needed to trigger a kill-chain alert; lower to 2 for high-security environments, raise to 4–5 to reduce noise
`detection.kill_chain.window_seconds`	`86400`	Observation window for tactic progression (24 h default); raise for slow APT scenarios
`detection.risk_half_life_hours`	`4`	Controls how quickly accumulated risk decays; lower (e.g. 2) for fast-moving environments; raise (e.g. 12) for persistent threat tracking
`detection.risk_threshold`	`50.0`	Risk score at which a risk-accumulation alert fires; lower to catch earlier accumulation; raise to reduce noise from minor repeated events

For deeper guidance see:

Performance Tuning¶

When Seerflow is under memory or CPU pressure the parameters below are the primary levers. Most have an upper bound enforced by an LRU cache that evicts the oldest entries when the limit is hit.

Resource	Parameter	Default	Tuning Advice
CPU (ingestion)	`receivers.queue_maxsize`	`10000`	Lower to apply back-pressure on log sources sooner; raise (up to 50,000) on high-throughput pipelines with sufficient RAM
CPU (scoring)	`detection.score_interval`	`1`	Set to `N` to score every Nth event per source — `score_interval: 5` cuts scoring CPU by ~80% with minimal recall loss on high-volume sources
Memory (per-source models)	`detection.max_sources`	`256`	LRU cap on sources with active detector state; lower on constrained hosts
Memory (template Holt-Winters)	`detection.max_template_hw`	`500`	Maximum number of Drain3 templates tracked by the volume detector; lower to reduce peak RSS
Memory (entity Holt-Winters)	`detection.max_entity_hw`	`500`	Maximum number of entities tracked by the entity-volume detector
Memory (correlation entities)	`correlation.max_entities`	`10000`	LRU cap on entity correlation windows; lower when RAM is limited
Disk I/O (model checkpoints)	`detection.model_save_interval_seconds`	`300`	Raise to 600–1800 to reduce checkpoint write frequency; increases potential state loss on crash

Monitoring eviction

When an LRU cache hits its capacity limit, Seerflow logs a WARNING message at the seerflow.detection or seerflow.correlation logger with the text evicting oldest entry. If you see this frequently, either raise the relevant cap or investigate whether the number of active sources/entities is unexpectedly large (possible misconfiguration or log flood). Set log_level: DEBUG temporarily to see eviction counts per minute.