Skip to content

Building the Graph

Seerflow builds its entity graph from log events in three steps: extract entities from each event, resolve them to deterministic IDs, and infer edges from co-occurring entities. This page covers each step.

Interactive: explore a sample entity graph

Sample entity graph showing an SSH pivot attack. Click any node for details, drag to reposition, scroll to zoom.


Entity Types

Seerflow recognizes six entity types. Each maps to a struct in the codebase and a UUID5 namespace for deterministic ID generation.

Type Example Canonical Form UUID5 Namespace
User alice, CORP\admin lowercase, domain-normalized a1b2c3d4-0001-...
IP 10.0.0.5, 2001:db8::1 normalized string a1b2c3d4-0002-...
Host web-server-01 lowercase FQDN a1b2c3d4-0003-...
Process sshd (pid 1234) name:pid:host a1b2c3d4-0004-...
File /etc/passwd absolute path a1b2c3d4-0005-...
Domain evil-c2.example.com lowercase, no trailing dot a1b2c3d4-0006-...

Deterministic IDs with UUID5

Every entity gets a deterministic UUID — the same entity always produces the same ID, regardless of which log source it appears in or when it's seen. This is how Seerflow correlates across sources.

The formula is simple:

entity_id = uuid5(namespace_for_type, canonical_form)

For example, user alice always produces:

uuid5(NS_USER, "alice") → always the same UUID

This means when alice appears in an SSH log, a sudo log, and a web access log, all three events link to the same graph node:

graph LR
    SSH["SSH log:<br/><i>Failed password for alice</i>"] --> N((alice<br/><small>uuid5: a1b2...c3d4</small>))
    SUDO["sudo log:<br/><i>alice ran apt install</i>"] --> N
    WEB["web log:<br/><i>alice accessed /admin</i>"] --> N
    style N fill:#4285f4,color:#fff

Username Normalization

Windows-style (CORP\admin) and email-style (admin@corp.local) usernames resolve to the same identity. Seerflow strips the domain prefix/suffix and lowercases:

  • CORP\Adminadmin
  • admin@corp.localadmin
  • Adminadmin

Entity Attributes

Each entity type carries additional attributes beyond its ID:

  • User: domain, email, SID, UID, groups, is_service_account
  • IP: version (4/6), is_private, is_tor_exit, ASN, geo (country/city)
  • Host: FQDN, OS family, IP addresses, MAC addresses
  • Process: PID, command line, image path, hashes, parent PID
  • File: path, name, hashes, size, owner
  • Domain: registrar, creation date, is_dga (domain generation algorithm)

Edge Inference

When an event mentions multiple entities, Seerflow infers edges between them. A single log line can create multiple edges.

Example: SSH Login

The log line:

Failed password for alice from 10.0.0.5 port 22 on web-server-01

Contains three entities: user alice, IP 10.0.0.5, host web-server-01. Seerflow creates three edges:

graph LR
    U[alice<br/><small>User</small>] -->|authenticated_from| I[10.0.0.5<br/><small>IP</small>]
    U -->|logged_into| H[web-server-01<br/><small>Host</small>]
    I -->|has_ip| H

Relationship Types

The full set of edge types, defined in the EDGE_TYPE_MAP:

Source Type Target Type Relationship
User IP authenticated_from
User Host logged_into
IP Host has_ip
User File accessed
IP Domain resolved_to
Process Process spawned_by

Edges are bidirectional in lookup — if an event has (IP, User), Seerflow checks both (ip, user) and (user, ip) in the map.

Core Set

The EDGE_TYPE_MAP above shows the relationship types currently implemented in edges.py. Additional relationship types (e.g., host-to-process, process-to-file) may be inferred by custom rules or future extensions. The interactive explorer includes illustrative relationship types beyond this core set.

Edge Deduplication

The same edge can be inferred from many events. Rather than creating duplicate edges, Seerflow merges them:

  • first_seen = earliest timestamp across all events
  • last_seen = latest timestamp across all events
  • event_count = total number of events that produced this edge

This means a single edge between alice and web-server-01 might represent hundreds of SSH sessions, with metadata showing when the first and last sessions occurred:

graph LR
    U((alice)) -->|"logged_into<br/><small>first: Jan 3 09:12<br/>last: Apr 9 14:30<br/>count: 347</small>"| H((web-server-01))

igraph Implementation

Seerflow uses igraph as its graph engine rather than the more commonly known NetworkX.

Why igraph?

Metric igraph NetworkX
Speed 40-250x faster Baseline
Memory per edge 32 bytes ~200+ bytes
Community detection (10K nodes) ~50ms ~5 seconds
PageRank (10K nodes) ~20ms ~2 seconds

For a streaming SIEM processing thousands of events per second, this performance difference matters. igraph is written in C with a Python binding, giving near-native performance with a Python API.

Graph Data Structure

# Seerflow's EntityGraph wraps igraph.Graph
graph = igraph.Graph(directed=True)  # directed multigraph

Key implementation details:

  • O(1) vertex lookup: An internal _vertex_map (dict of str → int) maps entity UUIDs to igraph vertex indices. Adding or finding a vertex is constant time.
  • Multigraph support: Multiple edges between the same pair of nodes are allowed if they have different rel_type values. A user can both logged_into and accessed the same host.
  • Edge attributes: Each edge stores rel_type, first_seen, last_seen, and event_count.
  • Vertex attributes: Arbitrary key-value pairs (community ID, centrality scores, risk scores) can be stored on vertices via set_vertex_attr().

Persistence

The entity graph survives restarts through export/import:

  • export_edges() returns a list of tuples: (source_id, target_id, rel_type, first_seen, last_seen, event_count)
  • load() rebuilds the graph from these tuples

This format is stored in the configured storage backend (SQLite or PostgreSQL).

Next: Algorithms & Detection → — how Seerflow analyzes the graph to detect threats.