Skip to main content

Document Processing Example

Learn how to build a document processing pipeline with Meibel AI that can extract, analyze, and answer questions about your documents.

Overview

This example demonstrates:
  • Ingesting multiple document types
  • Extracting structured information
  • Answering questions about documents
  • Generating summaries

Setup

from meibelai import Meibelai
import os

client = Meibelai(
    api_key_header=os.getenv("MEIBELAI_API_KEY_HEADER")
)

Step 1: Create a Document Datasource

# Create a datasource for documents
datasource = client.datasources.create(
    name="Company Documents",
    description="Internal company documentation and policies"
)
datasource_id = datasource.id

Step 2: Ingest Documents

# Add a policy document
policy_doc = client.dataelements.create(
    datasource_id=datasource_id,
    name="Employee Handbook",
    content="""
    Employee Handbook - Version 2024
    
    1. Work Hours: Standard hours are 9 AM to 5 PM
    2. Remote Work: Employees can work remotely up to 3 days per week
    3. Time Off: 20 days PTO per year, plus holidays
    4. Benefits: Health, dental, vision, and 401k matching
    """,
    metadata={
        "type": "policy",
        "category": "hr",
        "version": "2024",
        "last_updated": "2024-01-15"
    }
)

# Add more documents
contract_template = client.dataelements.create(
    datasource_id=datasource_id,
    name="Service Agreement Template",
    content="Standard service agreement template content...",
    metadata={
        "type": "template",
        "category": "legal"
    }
)

Step 3: Query Documents

# Ask questions about the documents
response = client.rag.chat(
    messages=[
        {"role": "user", "content": "What is our remote work policy?"}
    ],
    datasource_ids=[datasource_id],
    confidence_threshold=0.8
)

print(f"Answer: {response.choices[0].message.content}")
print(f"Confidence: {response.confidence_score}")

Step 4: Extract Structured Data

# Extract specific information
extraction_response = client.rag.chat(
    messages=[
        {
            "role": "system", 
            "content": "Extract all numerical values and policies as JSON"
        },
        {
            "role": "user", 
            "content": "List all employee benefits with details"
        }
    ],
    datasource_ids=[datasource_id],
    execution_control={
        "response_format": "json",
        "enable_tracing": True
    }
)

Step 5: Generate Summaries

# Generate document summaries
summary = client.rag.chat(
    messages=[
        {
            "role": "user", 
            "content": "Provide a concise summary of our employee handbook"
        }
    ],
    datasource_ids=[datasource_id],
    execution_control={
        "max_tokens": 200,
        "temperature": 0.3
    }
)

Advanced Features

Batch Processing

# Process multiple documents
documents = ["doc1.txt", "doc2.pdf", "doc3.docx"]

for doc in documents:
    # Read document content (implement based on file type)
    content = read_document(doc)
    
    # Add to datasource
    client.dataelements.create(
        datasource_id=datasource_id,
        name=doc,
        content=content,
        metadata={"filename": doc}
    )

Document Comparison

# Compare two versions of a document
comparison = client.rag.chat(
    messages=[
        {
            "role": "user",
            "content": "What changed between the 2023 and 2024 employee handbook?"
        }
    ],
    datasource_ids=[datasource_id],
    execution_control={
        "enable_tracing": True
    }
)

Best Practices

  1. Metadata Usage: Add rich metadata for better retrieval
  2. Chunking: Break large documents into manageable pieces
  3. Version Control: Track document versions
  4. Regular Updates: Keep documents current

Next Steps