DataPoints: Atomic Units of Knowledge
DataPoints are the smallest building blocks in Cognee.They represent atomic units of knowledge — carrying both your actual content and the context needed to process, index, and connect it. They’re the reason Cognee can turn raw documents into something that’s both searchable (via vectors) and connected (via graphs).
What are DataPoints
- Atomic — each DataPoint represents one concept or unit of information.
- Structured — implemented as Pydantic models for validation and serialization.
- Contextual — carry provenance, versioning, and indexing hints so every step downstream knows where data came from and how to use it.
Core Structure
A DataPoint is just a Pydantic model with a set of standard fields.See example class definition
See example class definition
id— unique identifier (shared across all three stores, linking vector, graph, and relational records for the same DataPoint)created_at,updated_at— timestamps (ms since epoch)version— for tracking changes and schema evolutiontopological_rank— an integer indicating the DataPoint’s position in a dependency hierarchy. Lower ranks mean fewer dependencies. For example, anEntitythat other DataPoints reference would have a lower rank than aTextSummarythat depends on it. Defaults to0.metadata.index_fields— critical: determines which fields are embedded for vector searchtype— the Python class name of the DataPoint subclass (e.g.,"Person","Book")belongs_to_set— groups related DataPoints
Indexing & Embeddings
Themetadata.index_fields tells Cognee which fields to embed into the vector store.
This is the mechanism behind semantic search.
- Fields in
index_fields→ converted into embeddings - Each indexed field → its own vector collection named
Class_field(e.g., aPersonDataPoint withindex_fields=["name"]creates aPerson_namevector collection). TheClasspart comes from the Python class name of your DataPoint subclass. - Non-indexed fields → stay as regular properties in the graph and relational stores
- Choosing what to index controls search granularity
Cross-store retrieval: When a vector search finds a match, Cognee uses the shared
id to retrieve the full DataPoint from the graph store, which holds all properties (not just the indexed field). This is how Cognee returns complete results from a semantic search.From DataPoints to the Graph
When you calladd_data_points(), Cognee automatically:
- Embeds the indexed fields into vectors
- Converts the object into nodes and edges in the knowledge graph
- Stores provenance in the relational store
Examples and details
Example: indexing only one field
Example: indexing only one field
"name" is semantically searchableExample: Book → Author transformation
Example: Book → Author transformation
Relationship syntax options
Relationship syntax options
Built-in DataPoint types
Built-in DataPoint types
Cognee ships with several built-in DataPoint types:
- Documents — wrappers for source files (Text, PDF, Audio, Image)
Document(metadata.index_fields=["name"])
- Chunks — segmented portions of documents
DocumentChunk(metadata.index_fields=["text"])
- Summaries — generated text or code summaries
TextSummary/CodeSummary(metadata.index_fields=["text"])
- Entities — named objects (people, places, concepts)
Entity,EntityType(metadata.index_fields=["name"])
- Edges — relationships between DataPoints
Edge— links between DataPoints
Example: custom DataPoint with best practices
Example: custom DataPoint with best practices
- Keep it small — one concept per DataPoint
- Index carefully — only fields that matter for semantic search
- Use built-in types first — extend with custom subclasses when needed
- Version deliberately — track changes with
version - Group related points — with
belongs_to_set
Tasks
Learn how DataPoints are created and processed
Pipelines
See how DataPoints flow through processing workflows
Main Operations
Understand how DataPoints are used in Add, Cognify, and Search