DataPoints: Atomic Units of Knowledge

DataPoints are the smallest building blocks in Cognee.
They represent atomic units of knowledge — carrying both your actual content and the context needed to process, index, and connect it. They’re the reason Cognee can turn raw documents into something that’s both searchable (via vectors) and connected (via graphs).

What are DataPoints

Atomic — each DataPoint represents one concept or unit of information.
Structured — implemented as Pydantic models for validation and serialization.
Contextual — carry provenance, versioning, and indexing hints so every step downstream knows where data came from and how to use it.

Core Structure

A DataPoint is just a Pydantic model with a set of standard fields.

See example class definition

class DataPoint(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    created_at: int = ...
    updated_at: int = ...
    version: int = 1
    topological_rank: Optional[int] = 0
    metadata: Optional[dict] = {"index_fields": []}
    type: str = "DataPoint"
    belongs_to_set: Optional[List["DataPoint"]] = None

Key fields:

id — unique identifier (shared across all three stores, linking vector, graph, and relational records for the same DataPoint)
created_at, updated_at — timestamps (ms since epoch)
version — for tracking changes and schema evolution
topological_rank — an integer indicating the DataPoint’s position in a dependency hierarchy. Lower ranks mean fewer dependencies. For example, an Entity that other DataPoints reference would have a lower rank than a TextSummary that depends on it. Defaults to 0.
metadata.index_fields — critical: determines which fields are embedded for vector search
type — the Python class name of the DataPoint subclass (e.g., "Person", "Book")
belongs_to_set — groups related DataPoints

Indexing & Embeddings

The metadata.index_fields tells Cognee which fields to embed into the vector store. This is the mechanism behind semantic search.

Fields in index_fields → converted into embeddings
Each indexed field → its own vector collection named Class_field (e.g., a Person DataPoint with index_fields=["name"] creates a Person_name vector collection). The Class part comes from the Python class name of your DataPoint subclass.
Non-indexed fields → stay as regular properties in the graph and relational stores
Choosing what to index controls search granularity

Cross-store retrieval: When a vector search finds a match, Cognee uses the shared id to retrieve the full DataPoint from the graph store, which holds all properties (not just the indexed field). This is how Cognee returns complete results from a semantic search.

From DataPoints to the Graph

When you call add_data_points(), Cognee automatically:

Embeds the indexed fields into vectors
Converts the object into nodes and edges in the knowledge graph
Stores provenance in the relational store

This is how Cognee creates both semantic similarity (vector) and structural reasoning (graph) from the same unit.

Examples and details

Example: indexing only one field

class Person(DataPoint):
    name: str
    age: int
    metadata: dict = {"index_fields": ["name"]}

Only "name" is semantically searchable

Example: Book → Author transformation

class Book(DataPoint):
    title: str
    author: Author
    metadata: dict = {"index_fields": ["title"]}

# Produces:
# `Node(Book)` with `{title, type, ...}`
# Node(Author) with {name, type, ...}
# Edge(Book → Author, type="author")

Relationship syntax options

# Simple relationship
`author: Author`  

# With edge metadata
`has_items: (Edge(weight=0.8), list[Item])`

# List relationship
`chapters: list[Chapter]`

Built-in DataPoint types

Cognee ships with several built-in DataPoint types:

Documents — wrappers for source files (Text, PDF, Audio, Image)
- Document (metadata.index_fields=["name"])
Chunks — segmented portions of documents
- DocumentChunk (metadata.index_fields=["text"])
Summaries — generated text or code summaries
- TextSummary / CodeSummary (metadata.index_fields=["text"])
Entities — named objects (people, places, concepts)
- Entity, EntityType (metadata.index_fields=["name"])
Edges — relationships between DataPoints
- Edge — links between DataPoints

Example: custom DataPoint with best practices

class Product(DataPoint):
    name: str
    description: str
    price: float
    category: Category
    
    # Index name + description for search
    metadata: dict = {"index_fields": ["name", "description"]}

Best Practices:

Keep it small — one concept per DataPoint
Index carefully — only fields that matter for semantic search
Use built-in types first — extend with custom subclasses when needed
Version deliberately — track changes with version
Group related points — with belongs_to_set

Tasks

Learn how DataPoints are created and processed

Pipelines

See how DataPoints flow through processing workflows

Main Operations

Understand how DataPoints are used in Add, Cognify, and Search

Getting Started

Core Concepts

Setup Configuration

Guides

Examples

CLI

OSS

DataPoints

DataPoints: Atomic Units of Knowledge

What are DataPoints

Core Structure

Indexing & Embeddings

From DataPoints to the Graph

Examples and details

Tasks

Pipelines

Main Operations

Getting Started

Core Concepts

Setup Configuration

Guides

Examples

CLI

OSS

​DataPoints: Atomic Units of Knowledge

​What are DataPoints

​Core Structure

​Indexing & Embeddings

​From DataPoints to the Graph

​Examples and details

Tasks

Pipelines

Main Operations

DataPoints: Atomic Units of Knowledge

What are DataPoints

Core Structure

Indexing & Embeddings

From DataPoints to the Graph

Examples and details