Document Processing

The document processing API extracts structured content from uploaded files. You can process documents asynchronously (submit a job, poll for results) or synchronously (block until done). This guide covers both workflows, plus streaming trace events for real-time progress.

Parse a document (async)

Submit a document for asynchronous parsing. The API returns a job ID immediately so your application stays responsive while the server processes the file.

import os
from meibel import MeibelClient

client = MeibelClient(api_key=os.environ["MEIBEL_API_KEY"])

with open("contract.pdf", "rb") as f:
    job = client.documents.parse_document(file=f, file_name="contract.pdf")

print(f"Job submitted: {job.job_id}")

Store the job_id from the response to check status, retrieve results, and stream trace events.

Poll for status

Check the processing status of a submitted document job. Poll until the status reaches "completed" or "failed".

import time

while True:
    status = client.documents.get_document_status(job_id=job.job_id)
    print(f"Status: {status.status}")

    if status.status == "completed":
        print("Processing finished")
        break
    elif status.status == "failed":
        print(f"Processing failed: {status.error}")
        break

    time.sleep(2)

A 2-second polling interval is recommended. For long-running jobs, consider using the streaming trace endpoint instead.

Get results

Once processing is complete, retrieve the extracted content in markdown or structured JSON format.

# Get results as markdown
markdown_result = client.documents.get_document_result(
    job_id=job.job_id,
    format="markdown",
)
print(markdown_result.content)

# Get results as structured JSON
json_result = client.documents.get_document_result(
    job_id=job.job_id,
    format="json",
)
print(json_result.content)

The markdown format returns a clean, readable representation of the document. The json format returns structured data including headings, tables, and extracted metadata.

Process synchronously

For smaller documents where you want the result in a single call, use the synchronous endpoint. It blocks until processing completes and returns the result directly.

with open("invoice.pdf", "rb") as f:
    result = client.documents.process_document(file=f, file_name="invoice.pdf")

print(result.content)

The synchronous endpoint is best for small files (under 10 MB). For larger documents, use the async workflow with polling or trace streaming.

List child documents

Some documents (e.g., archives, multi-part files) produce child documents during processing. List them by job ID.

children = client.documents.list_document_children(job_id=job.job_id)

for child in children:
    print(f"{child.file_name}: {child.status}")

Stream trace events

Stream real-time processing events for a document job. Trace events provide fine-grained progress updates such as page extraction, OCR steps, and content classification.

for event in client.documents.stream_document_trace(job_id=job.job_id):
    print(f"[{event.type}] {event.message}")

Trace events are delivered as Server-Sent Events (SSE). Each event includes a type (e.g., "progress", "page_extracted", "complete") and a human-readable message.

Getting Started

Concepts

Guides

Support

Document Processing

Document Processing

Parse a document (async)

Poll for status

Get results

Process synchronously

List child documents

Stream trace events

Getting Started

Concepts

Guides

Support

Documentation Index

​Document Processing

​Parse a document (async)

​Poll for status

​Get results

​Process synchronously

​List child documents

​Stream trace events

Document Processing

Parse a document (async)

Poll for status

Get results

Process synchronously

List child documents

Stream trace events