Building the Graph¶
Seerflow builds its entity graph from log events in three steps: extract entities from each event, resolve them to deterministic IDs, and infer edges from co-occurring entities. This page covers each step.
Interactive: explore a sample entity graph
Sample entity graph showing an SSH pivot attack. Click any node for details, drag to reposition, scroll to zoom.
Entity Types¶
Seerflow recognizes six entity types. Each maps to a struct in the codebase and a UUID5 namespace for deterministic ID generation.
| Type | Example | Canonical Form | UUID5 Namespace |
|---|---|---|---|
| User | alice, CORP\admin |
lowercase, domain-normalized | a1b2c3d4-0001-... |
| IP | 10.0.0.5, 2001:db8::1 |
normalized string | a1b2c3d4-0002-... |
| Host | web-server-01 |
lowercase FQDN | a1b2c3d4-0003-... |
| Process | sshd (pid 1234) |
name:pid:host |
a1b2c3d4-0004-... |
| File | /etc/passwd |
absolute path | a1b2c3d4-0005-... |
| Domain | evil-c2.example.com |
lowercase, no trailing dot | a1b2c3d4-0006-... |
Deterministic IDs with UUID5¶
Every entity gets a deterministic UUID — the same entity always produces the same ID, regardless of which log source it appears in or when it's seen. This is how Seerflow correlates across sources.
The formula is simple:
entity_id = uuid5(namespace_for_type, canonical_form)
For example, user alice always produces:
uuid5(NS_USER, "alice") → always the same UUID
This means when alice appears in an SSH log, a sudo log, and a web access log, all three events link to the same graph node:
graph LR
SSH["SSH log:<br/><i>Failed password for alice</i>"] --> N((alice<br/><small>uuid5: a1b2...c3d4</small>))
SUDO["sudo log:<br/><i>alice ran apt install</i>"] --> N
WEB["web log:<br/><i>alice accessed /admin</i>"] --> N
style N fill:#4285f4,color:#fff
Username Normalization¶
Windows-style (CORP\admin) and email-style (admin@corp.local) usernames resolve to the same identity. Seerflow strips the domain prefix/suffix and lowercases:
CORP\Admin→adminadmin@corp.local→adminAdmin→admin
Entity Attributes¶
Each entity type carries additional attributes beyond its ID:
- User: domain, email, SID, UID, groups, is_service_account
- IP: version (4/6), is_private, is_tor_exit, ASN, geo (country/city)
- Host: FQDN, OS family, IP addresses, MAC addresses
- Process: PID, command line, image path, hashes, parent PID
- File: path, name, hashes, size, owner
- Domain: registrar, creation date, is_dga (domain generation algorithm)
Edge Inference¶
When an event mentions multiple entities, Seerflow infers edges between them. A single log line can create multiple edges.
Example: SSH Login¶
The log line:
Failed password for alice from 10.0.0.5 port 22 on web-server-01
Contains three entities: user alice, IP 10.0.0.5, host web-server-01. Seerflow creates three edges:
graph LR
U[alice<br/><small>User</small>] -->|authenticated_from| I[10.0.0.5<br/><small>IP</small>]
U -->|logged_into| H[web-server-01<br/><small>Host</small>]
I -->|has_ip| H
Relationship Types¶
The full set of edge types, defined in the EDGE_TYPE_MAP:
| Source Type | Target Type | Relationship |
|---|---|---|
| User | IP | authenticated_from |
| User | Host | logged_into |
| IP | Host | has_ip |
| User | File | accessed |
| IP | Domain | resolved_to |
| Process | Process | spawned_by |
Edges are bidirectional in lookup — if an event has (IP, User), Seerflow checks both (ip, user) and (user, ip) in the map.
Core Set
The EDGE_TYPE_MAP above shows the relationship types currently implemented in edges.py. Additional relationship types (e.g., host-to-process, process-to-file) may be inferred by custom rules or future extensions. The interactive explorer includes illustrative relationship types beyond this core set.
Edge Deduplication¶
The same edge can be inferred from many events. Rather than creating duplicate edges, Seerflow merges them:
first_seen= earliest timestamp across all eventslast_seen= latest timestamp across all eventsevent_count= total number of events that produced this edge
This means a single edge between alice and web-server-01 might represent hundreds of SSH sessions, with metadata showing when the first and last sessions occurred:
graph LR
U((alice)) -->|"logged_into<br/><small>first: Jan 3 09:12<br/>last: Apr 9 14:30<br/>count: 347</small>"| H((web-server-01))
igraph Implementation¶
Seerflow uses igraph as its graph engine rather than the more commonly known NetworkX.
Why igraph?¶
| Metric | igraph | NetworkX |
|---|---|---|
| Speed | 40-250x faster | Baseline |
| Memory per edge | 32 bytes | ~200+ bytes |
| Community detection (10K nodes) | ~50ms | ~5 seconds |
| PageRank (10K nodes) | ~20ms | ~2 seconds |
For a streaming SIEM processing thousands of events per second, this performance difference matters. igraph is written in C with a Python binding, giving near-native performance with a Python API.
Graph Data Structure¶
# Seerflow's EntityGraph wraps igraph.Graph
graph = igraph.Graph(directed=True) # directed multigraph
Key implementation details:
- O(1) vertex lookup: An internal
_vertex_map(dict ofstr → int) maps entity UUIDs to igraph vertex indices. Adding or finding a vertex is constant time. - Multigraph support: Multiple edges between the same pair of nodes are allowed if they have different
rel_typevalues. A user can bothlogged_intoandaccessedthe same host. - Edge attributes: Each edge stores
rel_type,first_seen,last_seen, andevent_count. - Vertex attributes: Arbitrary key-value pairs (community ID, centrality scores, risk scores) can be stored on vertices via
set_vertex_attr().
Persistence¶
The entity graph survives restarts through export/import:
export_edges()returns a list of tuples:(source_id, target_id, rel_type, first_seen, last_seen, event_count)load()rebuilds the graph from these tuples
This format is stored in the configured storage backend (SQLite or PostgreSQL).
Next: Algorithms & Detection → — how Seerflow analyzes the graph to detect threats.