Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.meibel.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Context engineering is the process of turning unstructured documents into structured, queryable knowledge that AI agents can reason over. A raw PDF, scanned invoice, or legal contract is opaque to a language model — it needs to be parsed, decomposed, and indexed before an agent can use it to answer questions accurately. Meibel provides a complete pipeline from raw document to agent-ready context. Documents go through parsing (OCR, layout analysis, table extraction), structure detection (sections, headers, relationships), and data element extraction (atomic chunks with metadata). The output is a searchable knowledge base that agents query at inference time. This pipeline is the foundation of everything else in the platform. Without high-quality context, agents produce low-quality outputs — they hallucinate, miss relevant information, or cite the wrong source. Context engineering is where accuracy starts.

The Document Intelligence Pipeline

The pipeline has four stages, each preserving and adding information:
Document → Parse → Structure Detection → Data Elements → Knowledge Base
Parsing is the first stage: the system reads the raw file and extracts text, images, and layout information. For scanned documents, this means OCR. For digital PDFs, it means extracting the text layer directly. For complex formats, it means interpreting embedded tables, headers, footers, and page structure. Structure detection identifies the logical organization of the document. Where are the section boundaries? Which text is a heading versus body content? Are there lists, tables, or nested structures? This stage produces a document outline that preserves the author’s intended hierarchy. Data element extraction breaks the document into atomic units of knowledge. Each data element is a chunk of content — a paragraph, a table row, a section — with metadata about where it came from, what type of content it is, and how confidently it was extracted. These are the units that agents search over. Knowledge base storage indexes the data elements for fast retrieval. When an agent needs context to answer a question, it searches the knowledge base and retrieves the most relevant data elements, along with their metadata and provenance.

Adaptive Ingest

Not all documents are the same, and the platform adapts its parsing strategy based on what it encounters. A scanned invoice needs OCR to read the text, table extraction to identify line items, and layout analysis to distinguish headers from values. The system detects that the document is an image-based scan and routes it through the appropriate pipeline. A text-heavy legal contract needs section boundary detection to identify clauses, party references, and defined terms. OCR is unnecessary because the text layer is already present, but structural analysis is critical. A scientific paper may require formula handling, citation extraction, and figure caption parsing. The system identifies the academic format and applies specialized extractors.
You don’t need to specify the document type manually. The platform inspects the file format and content to select the right processing strategy automatically.
This adaptive approach means you can upload a mixed corpus — invoices alongside contracts alongside technical reports — and each document receives appropriate treatment.

Datasources as Context Containers

Datasources are organizational containers that group related documents into a single queryable knowledge base. Think of a datasource as a folder with intelligence: it doesn’t just store files, it processes them into searchable knowledge. The workflow is:
  1. Create a datasource with a name and description
  2. Upload files to the datasource
  3. Trigger ingestion to process uploaded files through the document intelligence pipeline
  4. Query the datasource from agents or directly via the data elements API
When you bind a datasource to an agent, the agent searches that datasource’s data elements to find relevant context for each query. You can bind multiple datasources to a single agent, giving it access to different knowledge domains. Datasources also support table descriptions — you can annotate tables and columns with human-readable descriptions that help the agent understand the structure of tabular data within your documents.

Data Elements

Data elements are the atomic units of extracted knowledge. Each one represents a discrete piece of content pulled from a source document during ingestion. A data element carries:
  • Content — the actual text or structured data
  • Source provenance — which document it came from, where in the document
  • Content type — paragraph, table, list, header, etc.
  • Extraction confidence — how confident the system is in the accuracy of the extraction
  • Metadata — additional structured fields extracted by metadata models
Data elements can be listed, searched by semantic query, and individually inspected. When an agent cites a source in its response, it references specific data elements, giving you a traceable path from answer back to source document.

Metadata Extraction

Beyond the content itself, you often need structured fields extracted from documents: dates, monetary amounts, party names, categories, status values. Metadata models handle this. A metadata model defines the structured fields you want to extract. When applied to a datasource, the platform runs the model against data elements and populates the specified fields. This adds queryable dimensions beyond full-text search — you can filter data elements by extracted date ranges, amounts, or categories. The metadata model catalog provides pre-built models for common extraction patterns, and you can define custom models for domain-specific fields.

Putting It Together

Here is the full pipeline in code: parse a document, create a datasource, upload, trigger ingestion, and search the extracted data elements.
from meibel import MeibelClient
from meibel.models import CreateDatasourceRequest, DataElementSearchRequest
import os

client = MeibelClient(api_key=os.environ["MEIBEL_API_KEY"])

# Parse a document
with open("contract.pdf", "rb") as f:
    result = client.documents.process_document(file=f, file_name="contract.pdf")
print(f"Parsed: {result}")

# Create a datasource and upload
ds = client.datasources.create_datasource(
    body=CreateDatasourceRequest(name="Contracts", description="Legal contracts", connector=ConnectorConfig(type="managed"))
)
with open("contract.pdf", "rb") as f:
    client.content.upload_content(file=f, file_name="contract.pdf")

# Trigger ingestion
client.content.trigger_ingest(datasource_id=ds.datasource_id)

# Search extracted data elements
results = client.data_elements.search_data_elements(
    datasource_id=ds.datasource_id,
    body=DataElementSearchRequest(regex_filter="termination"),
)
for elem in results.items:
    print(f"  {elem.name}: {elem.data_element_id}")
The process_document call runs parsing independently — useful for previewing extraction results before committing to a datasource. The datasource workflow (upload, ingest, search) is the standard path for building a persistent, queryable knowledge base.