Markov Chains¶
Security: Compromised Service Account — Impossible Command Sequence
svc-deploy normally follows predictable patterns: login → pull image → start container → health check. The attacker's sequence — login → sudo → cat /etc/shadow → scp — has near-zero transition probability in the learned model. Each step is individually plausible, but the sequence is impossible for this entity. Score: 0.95.
Operations: Service Restart Sequence Deviation
After OOM kills at T+30, api-gateway pods restart — but the init sequence is abnormal. The container starts the health check endpoint before the database migration completes, causing a cascade of failed readiness probes.
EVENT 03:30:14 Pod api-gateway-7f8d9 OOMKilled (exit code 137)
EVENT 03:30:16 Pod api-gateway-7f8d9 Pulling image api-gateway:v2.3.1
EVENT 03:30:22 Pod api-gateway-7f8d9 Started container api-gateway
EVENT 03:30:23 Pod api-gateway-7f8d9 Readiness probe failed: connection refused
EVENT 03:30:24 Pod api-gateway-7f8d9 Started migration runner
The expected restart sequence is pull → start → migrate → healthcheck. The observed sequence — start → healthcheck → migrate — means P(healthcheck | start) is near-zero in the learned model: healthcheck has never followed start directly. Each event is individually familiar, but the order is anomalous. Markov score: 0.88. See the Ops Primer for more on sequence-based failure detection.
Interactive: Markov sequence anomaly
Transition probability scores. A rare login → write transition at minute 180 scores above the threshold.
Theory¶
Intuition¶
A first-order Markov chain asks: "Given the last thing that happened, how surprising is this?" It models the probability of each template_id following another. If svc-deploy always follows "pull image" with "start container", but suddenly follows "pull image" with "cat /etc/shadow", the transition probability is near zero.
Per-entity tracking is critical: what's normal for a user account is abnormal for a service account. A human developer might legitimately sudo occasionally — but a CI/CD service account that has only ever pulled images and started containers has no business escalating privileges. The detector maintains a separate learned transition matrix for every entity it has observed, so each entity's baseline is judged on its own history.
Unlike content-based detectors (HST) or volume-based detectors (Holt-Winters), the Markov detector is blind to what any single event looks like in isolation — it only cares about order. This makes it uniquely sensitive to behavioral sequencing attacks that blend individually normal events into an impossible narrative.
Key Equations¶
Transition probability — the probability of template B following template A, with Laplace smoothing to handle unseen transitions:
Anomaly score — normalized negative log-probability, clamped to \([0, 1]\):
Where:
- \( A, B \) = consecutive
template_idvalues for the same entity - \( \varepsilon \) = Laplace smoothing constant (default
1e-6) - \( |V| \) = vocabulary size — number of distinct "from" templates seen for this entity
- \( \text{count}(A \rightarrow *) \) = total transitions out of template A
- The denominator \( -\log \varepsilon \) is the maximum possible surprisal — the score an entirely unseen transition would receive — which normalizes the score to \([0, 1]\)
A fully unseen transition (count = 0) yields score ≈ 1.0. A frequent, expected transition yields score ≈ 0.0. Smoothing ensures the score never literally reaches 1.0 or produces a division-by-zero error.
Seerflow Implementation¶
Configuration¶
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
markov_smoothing |
float |
1e-6 |
1e-9–0.01 | Laplace smoothing for unseen transitions. Lower values make unseen transitions score higher (closer to 1.0). |
markov_min_events |
int |
100 |
10–1000 | Minimum events per entity before scoring begins. Prevents noisy early scores when the transition matrix is sparse. |
markov_max_entities |
int |
1000 |
100–10000 | LRU cap on the number of tracked entities. When the cap is reached, the least-recently-used entity is evicted. |
Per-Entity Tracking¶
Each entity gets its own _EntityModel instance containing:
| Field | Type | Description |
|---|---|---|
prev_template |
int |
The template_id of the most recent event for this entity (-1 until first event) |
transitions |
dict[int, dict[int, int]] |
Nested dict: transitions[A][B] = count of A→B transitions observed |
total_from |
dict[int, int] |
total_from[A] = total transitions out of template A |
event_count |
int |
Total events seen for this entity |
Entity selection: The primary entity is entity_refs[0]. Events with no entity references, or with template_id == -1 (unrecognized by Drain3), return a score of 0.0 and are not learned.
Warmup: Score returns 0.0 until event_count >= min_events. This prevents anomaly noise during the initial observation period when the transition matrix contains very few counts.
LRU Eviction¶
Entity models are stored in a collections.OrderedDict keyed by entity ID. When max_entities is reached, the least-recently-used entity (the leftmost entry) is evicted via popitem(last=False). Each call to score() moves the entity to the end of the dict, refreshing its recency without creating the entity.
Note: score() performs a read-only lookup (no LRU promotion) — only _get_model() called from learn() moves an entity to the end. This matches the ensemble pipeline's pattern of calling score() before learn().
Serialization¶
Model state serializes to msgpack bytes via serialize() / deserialize(). The serialized payload includes:
- All hyperparameters (
smoothing,min_events,max_entities) - All per-entity models (transition dicts, total_from, prev_template, event_count)
- Entity insertion order is preserved (OrderedDict semantics)
Unlike the HST detector (which uses a restricted pickle unpickler), the Markov detector uses msgspec.msgpack for safe, schema-aware serialization with no deserialization attack surface.
Memory Footprint¶
Approximately 10 KB per entity, depending on vocabulary size. At default max_entities = 1000, total memory is approximately 10 MB. Each additional unique template_id in the transition matrix adds roughly 80 bytes (two int keys + one int value in the nested dict).
Practical Examples¶
Security Walkthrough¶
svc-deploy has been observed for 10,000 events. Its transition matrix reflects a highly repetitive, predictable CI/CD workflow:
| From template | To template | Count | P(B | A) |
|---|---|---|---|
| login (T1) | pull image (T2) | 4,982 | ≈ 0.998 |
| pull image (T2) | start container (T3) | 4,981 | ≈ 0.998 |
| start container (T3) | health check (T4) | 4,979 | ≈ 0.997 |
| login (T1) | sudo (T99) | 0 | ≈ 1.25e-10 |
When the attacker follows login with sudo: P(sudo | login) ≈ 1.25e-10. Anomaly score:
Sample detector output:
{
"detector": "markov",
"score": 0.95,
"entity": "svc-deploy",
"prev_template_id": 1,
"curr_template_id": 99,
"prev_template_label": "Accepted password for * from * port *",
"curr_template_label": "sudo: * : TTY=* ; PWD=* ; USER=root ; COMMAND=*",
"transition_count": 0,
"transition_probability": 1.25e-10,
"interpretation": "Near-zero sequence probability — this transition has never been observed for this entity"
}
Ops Walkthrough¶
api-gateway restarts normally follow pull → start → migrate → healthcheck. After the OOM kill, the container starts health checking before migration completes. The transition start → healthcheck has count 0 in the learned model (healthcheck has never directly followed start — migrate always comes first).
P(healthcheck | start) ≈ 1.25e-10. Score ≈ 0.88, firing well before the cascade of readiness probe failures produces volume-level signals in Holt-Winters or CUSUM.
Tuning Guide¶
When to Adjust¶
-
False positives from new entities: New entities start with empty transition matrices and reach
min_eventsbefore scoring begins, but early counts may still be sparse. Increasemin_eventsto 200 to require a more populated baseline before the detector fires. -
Missing sequence anomalies (novel transitions scoring too low): Decrease
markov_smoothingto1e-9. Lower smoothing increases the surprisal of unseen transitions, pushing scores closer to 1.0 for events the entity has never performed. -
High memory usage: Decrease
max_entitiesto 500. Entities with infrequent access will be evicted sooner. For deployments with a large number of ephemeral entities (short-lived containers, transient sessions), consider a smaller cap combined with a highermin_eventsso that ephemeral entities are unlikely to reach the scoring threshold before eviction.
Sensitivity Tradeoffs¶
| Smoothing | Unseen transition score | Best For |
|---|---|---|
1e-9 |
≈ 1.0 | High-security environments; maximum sensitivity to novel sequences |
1e-6 (default) |
≈ 0.95 | Balanced — catches rare transitions, tolerates minor vocabulary gaps |
1e-3 |
≈ 0.5 | Noisy environments with frequent template vocabulary churn |
Common Patterns¶
- Service accounts with predictable workflows: The Markov detector is most powerful here. A service account with 10,000 events and 5 distinct templates will have near-certainty for observed transitions and near-zero probability for anything else.
- Human users with variable workflows: Increase
min_events(e.g., 500) and smoothing (e.g.,1e-4). Human behavior is more varied; a lower smoothing value will generate false positives as users explore less common but legitimate paths. - Ephemeral entities (pods, containers): These rarely accumulate enough events to clear
min_events. The detector naturally skips them. If sequence anomalies in ephemeral entities are important, reducemin_eventsand accept more noisy early scores. - Post-deployment template shifts: New software versions often introduce new
template_idvalues. The Markov detector will score new templates as anomalous until their transitions are learned. This is expected — use HST's novelty signal for the initial burst, and allow Markov to catch sequence deviations once the new templates are established.
See Also¶
- Anomaly Detection concepts
- Ensemble overview
- Half-Space Trees (HST)
- Holt-Winters
- CUSUM
- Scoring & Attack Mapping
- Configuration Reference
Next: DSPOT → — adaptive thresholds via Extreme Value Theory and GPD tail fitting.