Skip to content

Tuning Guide

This page helps operators systematically reduce false positives, recover missed detections, and control resource usage. Start with the decision flowchart to identify which lever to pull, then consult the relevant section for parameter details.


Decision Flowchart

Symptom First thing to try Section
Too many alerts — mostly false positives Raise dspot.risk_level or use seerflow feedback <id> fp False Positives
Too many alerts — correct but noisy Increase dedup_window_seconds or lower detector weights Dedup & Weights
Too few alerts Lower the relevant detector threshold Detector Tuning
Wrong alerts — correlation misfires Adjust window_duration_seconds or late_tolerance_seconds Correlation Tuning
Wrong alerts — wrong detector emphasis Rebalance weights_* parameters Detector Tuning
High memory / CPU Tune LRU caps and score_interval Performance
flowchart TD
    Start[Alert volume feels wrong] --> TooMany{Too many or too few?}
    TooMany -->|Too many| Quality{Mostly correct<br/>but noisy?}
    TooMany -->|Too few| Detector[Lower relevant detector<br/>threshold]
    Quality -->|Noisy but correct| Dedup[Increase<br/>dedup_window_seconds<br/>or lower detector weights]
    Quality -->|Mostly false positives| FPFlow{Which lever?}
    FPFlow --> DspotRisk[Raise dspot.risk_level<br/>from 0.0001 to 0.001]
    FPFlow --> Feedback[Use seerflow feedback fp<br/>for per-entity adjustment]
    Start --> WrongKind{Wrong alerts?}
    WrongKind -->|Correlation groups wrong| Correlation[Tune window_duration_seconds<br/>and late_tolerance_seconds]
    WrongKind -->|Wrong detector fires| Weights[Rebalance weights_*<br/>parameters]
    Start --> Performance{High memory<br/>or CPU?}
    Performance -->|Yes| PerfTune[Tune LRU caps<br/>and score_interval]

    classDef action fill:#402aa1,stroke:#402aa1,color:#fff
    class Detector,Dedup,DspotRisk,Feedback,Correlation,Weights,PerfTune action

Flowchart Walkthrough

Too Many Alerts — Mostly False Positives

The DSPOT algorithm sets anomaly thresholds automatically using extreme-value theory. Its sensitivity is controlled by detection.dspot.risk_level, which is the tail-probability cutoff (default 0.0001, meaning 1-in-10,000 chance of a legitimate value exceeding the threshold). Raising this to 0.001 or higher makes the threshold more permissive, cutting false positives at the cost of slightly reduced recall.

detection:
  dspot:
    risk_level: 0.001   # was 0.0001 — 10x more permissive

For sustained improvement without manual threshold tinkering, use the operator feedback CLI. Marking an alert as a false positive nudges the affected detector's threshold upward by 5% for that entity:

seerflow feedback <alert-id> fp

Repeated feedback compounds: three FP marks on the same entity roughly doubles the threshold (1.05³ ≈ 1.16×). The adjustment persists across restarts because model state is saved to disk every detection.model_save_interval_seconds seconds (default 300 s).

Too Many Alerts — Not False Positives, Just Noisy

When alerts are technically correct but operationally overwhelming (for example, a single flapping service triggering dozens of alerts), the first lever is alert deduplication. The default deduplication window is 900 seconds (15 minutes): any alert with the same dedup key within that window is suppressed.

alerting:
  dedup_window_seconds: 1800   # extend to 30 minutes globally

For per-type control without changing the global default, use dedup_window_overrides:

alerting:
  dedup_window_overrides:
    ssh_brute_force: 3600        # 1 hour for brute-force alerts
    disk_usage_high: 300         # 5 minutes for disk alerts

If noise comes from a specific detector producing high scores, reduce its blending weight. Weights are relative — only their ratios matter because the pipeline divides each weight by the sum:

detection:
  weights_volume: 0.10    # was 0.25 — halve volume detector influence
  weights_content: 0.40   # was 0.30 — compensate with content weight

Too Few Alerts

The most targeted fix is to identify which detector class is responsible for the events you are missing, then lower its sensitivity threshold. See the Detector Tuning section below and the per-detector deep-dive pages for details.

Right Volume, Wrong Alerts — Correlation Issues

When individual detector scores look reasonable but correlated alerts are incorrect (for example, grouping unrelated events together, or splitting a real incident across multiple alerts), the problem is usually in the correlation time window or entity late-arrival tolerance.

Increasing correlation.window_duration_seconds (default 1800 s) allows more events to be grouped into the same incident. Increasing correlation.late_tolerance_seconds (default 30 s) accommodates clock skew between log sources.

correlation:
  window_duration_seconds: 3600   # extend to 1 hour
  late_tolerance_seconds: 120     # tolerate up to 2 minutes of clock skew

If the grouping logic is sound but the wrong detectors are driving the final score, rebalance the weights_* parameters as described above.


Detector Tuning

The table below lists common tuning goals with the exact parameter to change and the expected effect. All parameters live under the detection: YAML key.

Goal Parameter Direction Effect
Catch subtle content anomalies hst_window_size Lower (e.g. 500) Smaller reference window — HST adapts faster but may increase FPs
Reduce HST sensitivity on stable sources hst_window_size Raise (e.g. 2000) Larger reference window — more stable baseline, fewer FPs
Tighten volume spike detection hw_n_std Lower (e.g. 2.0) Narrower normal band — fires on smaller volume changes
Reduce volume alert noise hw_n_std Raise (e.g. 4.0) Wider normal band — only fires on large spikes
Detect gradual drift / slow mean shift cusum_drift Lower (e.g. 0.2) More sensitive to small persistent shifts
Score sequences with sparse data sooner markov_min_events Lower (e.g. 50) Starts scoring after fewer observed events
Prevent noisy DSPOT thresholds early on dspot.calibration_window Raise (e.g. 2000) Longer calibration phase before thresholds activate

For detailed parameter semantics and worked examples, see the per-detector pages:


Correlation Tuning

Parameters under correlation: and detection.kill_chain / detection.risk_* control how events are grouped into incidents and how entity risk accumulates over time.

Parameter Default Tuning Advice
correlation.window_duration_seconds 1800 Increase (up to 7200) for slow-moving attacks; decrease for high-throughput environments where grouping should be tighter
correlation.max_events_per_entity 1000 Lower to reduce memory per active entity; raise if legitimate bursts are being truncated
correlation.max_entities 10000 Sets the LRU cap for active entity windows; lower in memory-constrained environments
correlation.late_tolerance_seconds 30 Raise to 120–300 for distributed systems with significant clock skew
detection.kill_chain.tactic_threshold 3 Minimum distinct ATT&CK tactics needed to trigger a kill-chain alert; lower to 2 for high-security environments, raise to 4–5 to reduce noise
detection.kill_chain.window_seconds 86400 Observation window for tactic progression (24 h default); raise for slow APT scenarios
detection.risk_half_life_hours 4 Controls how quickly accumulated risk decays; lower (e.g. 2) for fast-moving environments; raise (e.g. 12) for persistent threat tracking
detection.risk_threshold 50.0 Risk score at which a risk-accumulation alert fires; lower to catch earlier accumulation; raise to reduce noise from minor repeated events

For deeper guidance see:


Performance Tuning

When Seerflow is under memory or CPU pressure the parameters below are the primary levers. Most have an upper bound enforced by an LRU cache that evicts the oldest entries when the limit is hit.

Resource Parameter Default Tuning Advice
CPU (ingestion) receivers.queue_maxsize 10000 Lower to apply back-pressure on log sources sooner; raise (up to 50,000) on high-throughput pipelines with sufficient RAM
CPU (scoring) detection.score_interval 1 Set to N to score every Nth event per source — score_interval: 5 cuts scoring CPU by ~80% with minimal recall loss on high-volume sources
Memory (per-source models) detection.max_sources 256 LRU cap on sources with active detector state; lower on constrained hosts
Memory (template Holt-Winters) detection.max_template_hw 500 Maximum number of Drain3 templates tracked by the volume detector; lower to reduce peak RSS
Memory (entity Holt-Winters) detection.max_entity_hw 500 Maximum number of entities tracked by the entity-volume detector
Memory (correlation entities) correlation.max_entities 10000 LRU cap on entity correlation windows; lower when RAM is limited
Disk I/O (model checkpoints) detection.model_save_interval_seconds 300 Raise to 600–1800 to reduce checkpoint write frequency; increases potential state loss on crash

Monitoring eviction

When an LRU cache hits its capacity limit, Seerflow logs a WARNING message at the seerflow.detection or seerflow.correlation logger with the text evicting oldest entry. If you see this frequently, either raise the relevant cap or investigate whether the number of active sources/entities is unexpectedly large (possible misconfiguration or log flood). Set log_level: DEBUG temporarily to see eviction counts per minute.