Confidence Scoring

Overview

Most AI platforms give you a single confidence score on the model’s output. That number tells you how confident the model is in its response — but it says nothing about the quality of the data that response was built on. Meibel is different. Because the platform owns the full stack — document parsing, data element extraction, and model inference — confidence scores are produced at every stage and compose end-to-end. A poorly parsed document reduces confidence all the way through to the final agent response, and you can see exactly where the degradation happened. This means you know not just “how confident is the model” but “how confident should you be in this entire answer, given everything that went into producing it.”

Document Parse Confidence

The first confidence score is produced at the parsing stage: how confident is the system that it correctly read and interpreted the source document? Several factors influence parse confidence:

OCR quality — for scanned documents, was the text readable? Blurry scans, handwritten text, and low-resolution images all reduce OCR confidence. A clean, high-resolution scan of typed text scores near 1.0; a faded photocopy of handwritten notes might score 0.4.
Structure detection — did the system correctly identify tables, sections, headers, and other structural elements? A well-formatted PDF with clear headings is easier to parse than a flat wall of text with no formatting.
Format recognition — did the system handle the file format properly? Standard PDFs and common image formats are well-supported. Unusual or corrupted files may parse with lower confidence.

Parse confidence is per-document. If you upload ten documents to a datasource, each gets its own parse confidence score. One bad scan doesn’t penalize the other nine.

Data Element Extraction Confidence

After parsing, the system breaks the document into data elements — atomic chunks of content. Each extraction produces its own confidence score. Extraction confidence reflects:

Boundary accuracy — did the system correctly identify where this data element starts and ends? Clean section breaks are easy; ambiguous paragraph boundaries in flowing text are harder.
Content type classification — is this a table, a paragraph, a list, a heading? Misclassifying a table as a paragraph means the structure is lost.
Metadata accuracy — for metadata extraction models, how confident is the system in the values it extracted? A clearly formatted date like “January 15, 2025” extracts with high confidence; an ambiguous reference like “next quarter” extracts with low confidence.

Model Output Confidence

The final stage is the language model’s confidence in its response. This is informed by the quality of context it received. When an agent searches its datasources for relevant context, it retrieves data elements. If those data elements were extracted with high confidence from well-parsed documents, the model has a solid foundation. If the data elements came from noisy OCR with uncertain boundaries, the model is working with degraded input — and the confidence score reflects this. Model output confidence also accounts for:

Context relevance — did the retrieved data elements actually address the user’s question, or were they tangential?
Consistency — do the retrieved data elements agree with each other, or do they contain contradictory information?
Coverage — did the agent find enough relevant context, or is it extrapolating from sparse evidence?

Score Stacking

Confidence scores compose multiplicatively across the pipeline. This is the key insight: each stage’s confidence attenuates the scores downstream. Consider a concrete example:

Stage	Confidence	Effective Confidence
Document parse	0.70	0.70
Data element extraction	0.90	0.63 (0.70 x 0.90)
Model output	0.95	0.60 (0.63 x 0.95)

The model is 95% confident in its response — but the effective confidence is only 60% because the source document parsed poorly. Without pipeline-wide scoring, you would see 0.95 and trust the answer. With score stacking, you see 0.60 and know to verify. When an agent cites a specific data element in its response, the citation carries the propagated confidence for that element. If the agent uses multiple sources, you can see which citations are high-confidence and which are not.

A high model confidence score alone does not mean the answer is trustworthy. Always check the effective (stacked) confidence, which accounts for upstream data quality. A model can be confidently wrong if it was given confidently extracted but poorly parsed input.

Using Confidence in Practice

Accessing Confidence Scores

The confidence scoring API lets you track scoring jobs and get aggregate summaries:

from meibel import MeibelClient
import os

client = MeibelClient(api_key=os.environ["MEIBEL_API_KEY"])

# Get confidence scoring summary
summary = client.confidence_scoring.get_confidence_scoring_summary()
print(f"Overall confidence: {summary}")

# Check confidence on individual scoring jobs
jobs = client.confidence_scoring.list_confidence_scoring_jobs()
for job in jobs.data:
    print(f"Job {job.job_id}: {job.status}")

Chat Response Quality Signals

Agent chat responses include data you can use to assess the quality of each answer:

from meibel.models import ChatMessageRequest

response = client.agents.sessions.send_chat_message(
    session_id=session.session_id,
    body=ChatMessageRequest(
        user_message="What is the termination clause?",
        include_tool_activity=True,
    ),
)

# Check sources and token usage
print(f"Sources: {response.response.sources}")
print(f"Token usage: {response.token_usage}")

Setting Thresholds

How you use confidence scores depends on your use case:

Effective confidence above 0.8 — safe for automated decisions in most contexts. The source data was well-parsed and the model is confident.
Effective confidence 0.5 to 0.8 — present the answer to the user but flag it for potential review. Something in the pipeline was uncertain.
Effective confidence below 0.5 — route to human review. Either the source document was poorly parsed, the extraction was uncertain, or the model lacked sufficient context.

Start with conservative thresholds and relax them as you build confidence in your document quality and agent configuration. It is easier to loosen thresholds than to recover from bad automated decisions.

These thresholds are not universal — a medical application should set them higher than an internal FAQ bot. The key is that pipeline-wide confidence gives you the information to make that call.

Getting Started

Concepts

Guides

Support

Confidence Scoring

Overview

Document Parse Confidence

Data Element Extraction Confidence

Model Output Confidence

Score Stacking

Using Confidence in Practice

Accessing Confidence Scores

Chat Response Quality Signals

Setting Thresholds

Getting Started

Concepts

Guides

Support

Documentation Index

​Overview

​Document Parse Confidence

​Data Element Extraction Confidence

​Model Output Confidence

​Score Stacking

​Using Confidence in Practice

​Accessing Confidence Scores

​Chat Response Quality Signals

​Setting Thresholds

Overview

Document Parse Confidence

Data Element Extraction Confidence

Model Output Confidence

Score Stacking

Using Confidence in Practice

Accessing Confidence Scores

Chat Response Quality Signals

Setting Thresholds