Building a multi-agent invoice reconciliation platform in Databricks

Home › Blog › Building a compound AI system for invoice management automation in Databricks: Architecture and TCO considerations

Financial services organizations process millions of invoices monthly, with manual invoice reconciliation taking an average of 9.7 days per invoice and error rates reaching 12%.

For enterprises generating thousands of invoices monthly, these inefficiencies magnify into significant operational costs and risks:

– Vendor relationship damage from delayed payments

– Compliance exposure from manual errors

– Missed revenue and productivity from staff time diverted to manual work

– Growth constraints from non-scalable processes and fragmented tooling

Industry research indicates that automation is a practical lever for the finance sector.

According to McKinsey data, automation can help finance teams reach over 90% straight-through processing rates, compared to the current 50% industry average.

Deloitte reports that automated reconciliation reduces errors by 75% and accelerates financial close by 2-4 days.

That said, traditional automation approaches, such as rules-based systems and simple AI tools, struggle with the complex invoice processing cases, like overpayments and invoice-to-receipt mismatches.

In these cases, a network of specialized AI agents, controlling every step and catching edge cases, outperforms ‘vanilla automation’. Сompound systems are more accurate (66% vs. 55% for single agents) and have better reasoning benchmark scores (3.6 vs 3.05).

However, orchestration comes with latency and infrastructure cost challenges. In the same comparison, single agents produced outputs in 61 seconds, whereas compound systems needed 325 seconds.

To demonstrate how to build and optimize compound AI systems for invoice reconciliation on the Databricks Data Intelligence Platform, we’ll share architectural decisions, cost optimization strategies, and performance outcomes.

From a production implementation that reduced processing time from days to minutes while maintaining enterprise-grade governance and auditability.

Why Databricks for a compound AI system

Our multi-agent invoice reconciliation system runs on Databricks for several practical reasons.

Purpose-built agent tooling. Databricks’ Mosaic AI Agent Framework and Agent Evaluation provide native support for multi-agent orchestration with built-in testing capabilities.

This eliminates the complexity of integrating multiple third-party tools and enables systematic evaluation of agent performance across the entire workflow.

Reliable retrieval on unstructured data. Databricks Vector Search is optimized for unstructured content, which is particularly important because most invoices arrive as PDFs. Accurate retrieval was crucial for matching invoices, receipts, and exceptions without relying on brittle heuristics.

Enterprise governance and lineage. Unity Catalog provides attribute-based access control and automatic data lineage tracking across all agents and datasets.

For financial services organizations, this built-in governance eliminates the need for custom audit trail implementations.

Unified platform architecture. Rather than stitching together separate tools for data ingestion, model serving, workflow orchestration, and monitoring, Databricks provides these capabilities within a single platform.

This reduces integration complexity, minimizes data movement costs, and simplifies troubleshooting across the entire compound AI pipeline.

Compound AI delivers value only when data, orchestration, and governance live in one place. On a unified platform like Databricks, shipping use cases like invoice reconciliation, exception handling, and compliance reporting is faster and has fewer moving parts. The scalability and robust capabilities help turn prototypes into reliable enterprise outcomes.

— Dmitry Sverdlik, CEO, Xenoss

Architecture and cost optimization for compound AI reconciliation

Building compound AI systems requires careful architectural decisions and cost management strategies.

Each agent in our reconciliation pipeline was designed with specific performance and economic constraints in mind.

Data ingestion

The primary challenge in invoice reconciliation involves processing diverse, high-volume data sources, including invoices, purchase orders, statements, receipts, and vendor communications, all in multiple formats.

To build a cost-effective ingestion pipeline, the engineering team prioritized:

Autoscaling on new arrivals to prevent idle compute from burning the budget.
Creating source-faithful, replayable raw copies for audit and replay scenarios.
Capturing rich metadata (sender, system of origin, timestamps, checksums).
Tolerating schema drift (new columns, attachment types, EDI segments) without outages.
Exposing stable data contracts for downstream agent consumption.
Preserving lineage and access control that auditors and contractors can navigate.

Data ingestion with the Databricks ecosystem

Data ingestion in Databricks — We built a data ingestion pipeline in Databricks to collect invoice data from multiple sources

Our invoice ingestion pipeline leverages Databricks Workflows, Auto Loader, and DLT to automatically collect, process, and store data from multiple sources with built-in error handling and schema management.

Workflows run on a 30-minute schedule and fire in response to event triggers (file arrival).

Parallel Workflows tasks poll each data source: Gmail invoice mailboxes, SFTP servers, ERP export APIs, and vendor portals. A coordinating Workflow standardizes error handling, and a successful uploads trigger the incremental load.

Auto Loader ingests new objects incrementally into Delta tables, maintains checkpoints, and handles schema inference and evolution automatically.

A Bronze layer keeps a verbatim, defensible record with complete metadata.

Delta Live Tables (DLT) enforces deduplication and constraints to ensure downstream agents receive clean data without duplicates.

TCO considerations for the Databricks ingestion setup

Our key TCO consideration was minimizing waste from upstream volatility by stopping DBU churn from failed retries and cutting per-request Model Serving calls on non-actionable payloads.

We were looking for ways to profile cost hot spots (retry storms, reprocessing, unnecessary inference) and redesign the ingestion path to filter inputs early and only escalate clean, schema-vetted data.

With that in mind, the engineering team implemented a few architectural considerations.

Adopting a “rescue first, promote later” approach to schema evolution. Unexpected changes in vendor exports and EDI can disrupt ingestion jobs, resulting in a series of failed retries that burn DBUs and then require additional costs for reprocessing.

To avoid this, route unknown attributes to the Auto Loader’s rescued data column, and then run a “schema steward” task to inspect and approve the rescued fields.

To eliminate non-invoices from passing down the pipeline, we set up microfilters before passing tasks over to the capture agent. A Workflows task that uses MIME allowlists, size thresholds, and filename heuristics to filter logos or signatures and filter only elements that look like invoices.

These tweaks created significant compound savings on Model Serving costs, which are calculated per request.

Business outcomes

The optimized ingestion pipeline delivered measurable improvements across key performance indicators.

Combining time-based scheduling with event-driven processing reduced time-to-post from 9 to 4 days. A robust metadata layer with stable data contracts minimized duplicate records passed to downstream agents, increasing straight-through processing by 12%.

Auto Loader checkpoints that reduce idle compute consumption decreased DBU usage per 1,000 processed records by 27%.

Pre-filtering non-invoice content through MIME validation, file size thresholds, and filename heuristics reduced unnecessary processing overhead for downstream AI models by 40% at current data volumes.

Step 1. Invoice capture

Invoice capture represents the highest-risk component of the reconciliation pipeline. Errors here cascade through all downstream agents, making accuracy, scalability, and reliable deployment practices critical for system performance.

The Capture agent processes invoice documents using specialized OCR and extraction models trained on financial document formats. When confidence scores fall below predefined thresholds (typically 85% for critical fields like amounts and vendor information), the system automatically routes invoices to human reviewers with specific guidance on required validation.

The capture process handles diverse input formats—PDFs, scanned images, photos, and EDI files, through a multi-stage pipeline: document classification, OCR processing, field extraction, and line-item parsing. This multi-modal approach ensures consistent data extraction regardless of how vendors submit their invoices.

Databricks tools supporting the Capture agent

Building an Invoice Capture agent in Databricks — Using MLFlow Model Registry, we created an agent that checks ingested invoice data

Serverless Model Serving provides a low-latency document processing that scales automatically with invoice volume while avoiding “always-on” compute costs. The autoscaling endpoints ramp up resources when new invoice batches arrive and scale down during idle periods.

MLflow Model Registry versions every change (OCR parameters, fine-tuned extractors, next-gen models) and allows engineers to promote or revert after accuracy/calibration review, so iteration never jeopardizes operations. MLflow enables cohort-specific models that route invoices to pipelines optimized for specific vendor formats (e.g., non-standard document layouts or complex multi-page invoices).

Delta Live Tables with Expectations reads capture outputs, materializes silver tables, and enforces type, range, semantic, and referential checks.

Records that pass the data quality check flow straight to Normalization and Matching. Records that fail land in a quarantine table with machine-readable reasons and flagged low-confidence fields, which automatically create human-in-the-loop tasks (e.g., “Low confidence regarding invoice_total”).

This architecture delivers a capture layer that stays fast under load, aligns spend with demand, and produces auditable, high-quality inputs for the rest of the reconciliation workflow.

TCO considerations for building an invoice capture agent in Databricks

For data capturing, we focused on squeezing down inference spend per document to avoid unnecessary model calls, cut re-runs, and keep GPU/DBU usage predictable under bursty loads.

Monitor budget and pre-endpoint cost attribution. To keep infrastructure costs lean, our engineering team tracked DBU spend, QPS, and latency per serving endpoint, using tags mapped to teams and suppliers. Instant detection of overloaded endpoints prevented multi-day cost overruns.

Set rate limits for OCR endpoints. We added QPS ceilings per user to flatten activity bursts, reduce the financial burden of load tests or agent storms, and keep infrastructure spend predictable.

Use tiered model routing by directing standard invoice formats to lightweight general models while routing complex or non-standard formats to specialized vendor-specific models. This reduced per-invoice inference costs because the majority of invoices use “cheap” compute, while high-accuracy endpoints were only called on demand.

Prevent small file writes. Tuning batch sizes and trigger intervals prevents the extractor from creating small files that increase metadata overload and read I/O for every downstream agent. Larger files reduce DBU consumption and improve query performance.

How AI-enabled invoice capture improved reconciliation outcomes

Cohort-specific models deployed through MLflow significantly improved extraction quality for critical fields: supplier data, dates, totals, and tax information, with validation error rates below 2%.

Setting up data quality checks in DLT Expectations improved confidence calibration, with expected calibration error (ECE) dropping from 0.12 to 0.05.

On a broader scale, an improved invoice capture pipeline helped cut total AP cycle time from 9 to 4 days thanks to serverless autoscaling endpoints, event and time triggers, and instant exception routing.

Step 2. Data normalization

The Normalization agent receives structured outputs like invoice headers, line items, confidence scores, and raw vendor identifiers from the Capture stage and transforms them into canonical business entities.

This process involves standardizing currencies and amounts, applying tax logic, enforcing consistent units of measure, and mapping vendor strings or IDs to unified canonical entities.

Invoice normalization with Databricks

Building an Invoice normalization agent in Databricks — The archtecture of an invoice normalization agent we built in Databricks

On Databricks, the pipeline runs in Delta Live Tables (DLT), where Expectations enforce quality checks before records move downstream.

We express business logic in SQL for joins, windowing, aggregates, and invariants, and use PySpark when we need richer programmatic control, like complex conversions or jurisdiction-specific legal lookups.

Tax policy is centralized and governed by user-defined functions (UDFs). It’s a single source of truth that the Normalization agent calls to navigate rate tables, determine whether a jurisdiction is tax-inclusive, and apply the correct rounding mode. Because these UDFs are shared across pipelines, invoice totals are computed consistently regardless of source.

A recurring challenge is vendor identity drift across regions (e.g., “International Business Machines Corporation” vs. “IBM Italia S.p.A.”). VAT/tax IDs are the preferred deterministic keys, but in edge cases, they may be missing or corrupted.

To increase recall without hard-coding name variants, we add a semantic layer using Mosaic AI Vector Search. The vector index is auto-synced with Delta tables and governed in Unity Catalog, and it can be queried using multiple signals (names, addresses, email domains, bank accounts). This change-aware approach reduces scanned bytes, minimizes downstream cache churn, and prevents Delta log growth.

TCO considerations for the Invoice normalization agent in Databricks

When building the agent, we had to watch out for wide joins, repeated passes over the same data, and costly external lookups that ballooned DBUs.

We took three steps to prevent these events and slash TCO for data normalization.

Implement incremental normalization. Rather than reprocessing all daily data, the agent only recomputes invoices with changed inputs from reviewer corrections or field updates. This change-aware approach reduces scanned bytes, minimizes downstream cache churn, and prevents Delta log bloat.

Use two-layered vendor validation: deterministic-first, semantic-later. The agent runs deterministic checks (exact matches on tax IDs or stable fields) before expensive semantic searches. Most vendor aliases resolve through simple matching. Reserve vector search for failed deterministic searches, with QPS caps and human-in-the-loop fallbacks to prevent repeated expensive queries.

Move expensive checks offline. Keep inline validation narrow (type compliance, required fields, vendor ID checks). Run heavy or low-yield checks in separate daily jobs that write to dedicated tables rather than blocking hourly processes.

How a Normalization agent optimizes invoice reconciliation

Introducing an intelligent normalization agent helped reduce errors and increase straight-through processing (matching with no human oversight) by 12%.

Intelligent vendor aliasing cut false positives by 40% and cut the total number of vendor duplicates in master data to 0.5% of the total. Tax discrepancy defects dropped by 55% after the engineering team created a single source of truth for tax rates.

Step 3. Invoice data matching

The matching layer that executes company policy deterministically, reacts to late-arriving receipts, and keeps an auditable trail, so most invoices are auto-approved, edge cases are surfaced with context, and only actual variances reach humans.

The Matching agent automates the reconciliation by retrieving POs, receipts, and ERP entries. It approves every incoming invoice in accordance with the company’s policy, including two-way or three/four-way matching.

The Matching agent can yield three outcomes:

Approved
Flagged for policy acceptance/review
Variance raised for human decision.

Data engineering toolset for invoice matching built with Databricks

Building an invoice Matching agent in Databricks — Data engineering toolset for invoice matching built with Databricks

On Databricks, policy is encoded as set-based SQL over Silver (normalized) Delta tables, making decisions transparent, scalable, and easy to audit.

Workflows orchestrate the process in an event-driven way: a job fires only when a normalized invoice arrives in SILVER, and listeners monitor receipt updates (since invoices often arrive first), automatically queuing items marked awaiting receipts.

For real-time context in borderline cases, the platform connects to ERPs via native connectors where available and RPA bridges for legacy systems without APIs.

This two-way link enables the agent to both retrieve fields needed for reconciliation and attach evidence (e.g., service acceptance documents) to the ERP record.

As a result, a policy-driven matching process runs on change instead of a timer, minimizing reprocessing and keeping every decision traceable.

Databricks TCO considerations for building a reconciliation matching agent

We wanted to keep matching costs linear and predictable, which is why the engineers decided to compare only what changed today instead of rescan the entire ledgers.

We noticed that the biggest budget leaks came from reprocessing full tables, uneven join keys that cause expensive shuffles, and scoring lots of unlikely record pairs.

Here is how we fixed this problem and built a cost-effective reconciliation matching agent.

Materialize open-receivable states. We converted window aggregations into O(1) lookups to reduce shuffle volume and executor memory usage.

Set up ERP/RPA evidence cache with TTL and batching. ERP and RPA connections are compute-intensive. Caching results to reduce repeated reads solved this problem, and batching kept per-call overhead under control.

Use persistent match bindings. We created an input hash for invoice lines and reused decisions from prior lines unless the input hash changed. When it did, engineers evaluated only the specific line and appended the new version to the existing records.

How the Matching agent contributed to higher reconciliation efficiency

Intelligent matching helped APs spend less time handling exceptions: 10 minutes on average compared to 28 minutes per invoice before the introduction of the new system.

Infrastructure cost optimization techniques like persistent bindings reduced DBUs per 1k invoices by 25%. Evidence caching with TTL brought RPA reads per 1000 invoices down by 30%.

Step 4. Variance resolution

In a variance workflow, which is policy-consistent and auditable by design, routine discrepancies are resolved automatically, reviewers see only well-contextualized edge cases, and each decision strengthens the system’s future reasoning.

The Variance resolution agent, notified about invoice discrepancies by the Matching agent, classifies the variance, explains the likely root cause, recommends (or executes) the proper fix, and leaves a complete audit trail.

How Databricks tools support an agent for variance resolution

Building an invoice Variance resolution agent in Databricks — Data engineering tools we used to build an invoice variance detection agent in Databricks

On Databricks, the variance-resolution loop runs inside the Mosaic AI Agent Framework, where granular permissions, preconditions, and a traceable event log enforce policy before any action is taken. When the Matching agent flags a discrepancy, the Variance agent is invoked to investigate.

The agent first classifies the variance type (e.g., a price variance within a discretionary band) and reviews similar prior cases and outcomes, such as adjusted receipts, updated prices, blocked payments, or re-invoicing. It then recommends corrective actions by combining deterministic finance rules with patterns learned from previous resolutions. Low-impact fixes are executed automatically; higher-impact or ambiguous cases are routed for human review.

For human-in-the-loop reviewers, work is conducted in DBSQL/Lakeview dashboards that present each variance with its type, retrieved similar cases, deltas, and the system’s recommended next steps. After a decision is made (e.g., approving a correction or escalating to the buyer), the input is versioned and written back to the agent.

The agent re-evaluates the outcome and records human choices to strengthen future recommendations, while the framework’s event log preserves an auditable trail end-to-end.

TCO considerations for building AI-enabled variance resolution in Databricks

Invoking high-performance models to address variance issues that could be solved deterministically would drive TCO, paradoxically reducing resolution accuracy (LLMs are significantly more unpredictable than simple heuristics).

That’s why we set up guardrails to make sure the agent only escalates variances to AI when deterministic rules can’t solve the problem.

The agent auto-resolved repeated exceptions. Creating a list of recurring variance patterns and their outcomes helped detect similar exceptions and short-circuit them.

This approach cuts the total number of Vector Search and LLM calls, simplifies the pipelines, and reduces human involvement in HITL validation.

We adopted tiered reasoning to classify all detected issues. Simple variances were addressed through deterministic policy rules, based on historical data.

Only if these systems fail should an LLM Advisor-powered agent step in. This approach conserves LLM calls and tokens, adds a layer of predictability to the system, and enables faster resolution for less complex variations.

The Variance resolution agent contributes to higher reconciliation efficiency

1.2 days is the new variance closure time, down from 2 days (60% reduction), achieved through combined deterministic and AI-powered reasoning that resolves repeated variances while focusing compute on edge cases.

47% reduction in cost per variance check resulted from tiered reasoning, QPS limits, and infrastructure optimizations.

12 minutes is the average time APs now spend reviewing exceptions per variance, down from 35 minutes, despite humans remaining part of the HITL pipeline.

Step 5. Invoice posting

In a posting workflow, policy decisions are converted into ERP transactions and scheduled payments consistently, accurately, and on time. Routine postings run automatically, while edge cases carry the necessary evidence for swift review, and every action leaves a clear record.

The Posting agent takes the outcome from matching and variance resolution, then creates the ERP transaction and payment run.

It calculates due dates, discount windows, payment blocks, and preferred payment cycles based on vendor terms, treasury rules, cutoff times, and the holiday calendar. It also produces remittance details and, on AP request, generates payment files (e.g., XML) for treasury approval.

Databricks toolset for intelligent invoice posting

Building an invoice Posting agent in Databricks — Databricks toolset we used to create an intelligent invoice posting agent

On Databricks, posting is driven by a Model Serving endpoint that packages the deterministic checks and utilities needed before anything enters the ERP: cash-discount eligibility, control validations, remittance preparation, and payment-file generation.

Each call returns a signed, reproducible validation and parameter record, so posting decisions are traceable and easy to roll back if required.

Workflows orchestrate the process end-to-end. A job triggers as soon as the Matching agent marks an invoice ready to post; schedules define payment-run windows (e.g., daily at 3 PM), and period-close holds pause posting at month/quarter end and resume automatically after close.

The Posting agent writes outcomes to Gold postings, enabling learning components and analytics to track results without repeatedly calling the ERP.

TCO considerations for building an invoice posting agent in Databricks

Duplicate submissions, posting low-confidence invoicing, and ERP retries rack up infrastructure costs and negatively affect the agent’s performance.

The tweaks helped prevent this expensive rework and keep TCO under control.

Setting up posting hash verification. Use hashing in Model Serving endpoints to prevent duplicate postings, ERP reversals, and redundant connector jobs.

Designing a two-lane posting queue for invoices. Process critical vendor invoices immediately in micro-batches, utilizing scheduled payment runs (e.g., 3 PM) to generate single payment files per batch, thereby reducing posting costs.

Creating an ERP evidence cache. Save answers to repeated status checks (e.g., payment blocks) to reduce API calls and prevent ERP system overload by limiting connections.

Intelligent invoice posting workflow streamlined reconciliation

The invoice posting agent helps APs capture discounts and cut late-fee incidents by over 60%. Thanks to pre-posting validation, the ERP acceptance rate reached 98% compared to 92% for the pre-automation workflow.

Since the implementation of automated posting, the total posting time has gone down from 45 to 10 minutes per invoice on average.

Step 6. Learning and iteration

In a learning workflow, the system monitors itself in production and improves with every cycle.

The Learning and Iteration agent observes outcomes across components and human-in-the-loop decisions to recommend targeted changes, such as adjusting confidence thresholds, switching models, or refining routing rules.

The Learning and Iteration agent ingests three types of signals:

Quality: correctness, the need for human overrides
Cost and latency: serving costs, DBU, queueing, and processing time
Safety: policy violations and unsupported actions.

Building a Learning and Iteration agent in Databricks

Building an Learning and iteration agent in Databricks — Databricks architecture for the Learning and iteration agent

With Databricks, evaluations are set up in Lakehouse Monitoring for GenAI to measure behavior in real workloads.

The Learning agent queries logs emitted by other agents to quantify drift, check confidence thresholds, validate guardrails, and score category metrics (e.g., price-variance resolution accuracy).

Proposed changes are implemented via MLflow: promising runs are registered, rollouts can be introduced gradually, and any underperforming update can be reverted immediately. This closes the loop, ensuring that each decision informs the next without sacrificing governance or auditability.

Cost reduction mechanisms for the Learning and Iteration agent

The most challenging part of designing the learning agent that closes the loop on the entire system was to have the agent make the most out of the data it has before starting new experiments.

We made a few workflow tweaks that minimized resource consumption and helped capture more insight from the entire system’s performance.

Right-sized infrastructure per cohort. The system validates lower-cost paths by gradually routing small invoice cohorts (5%) to cheaper stacks. This helps expand successful configurations while maintaining SLAs.

Capped token usage and retrieval costs. We set hard budget caps per agent and cohort, cached vector embeddings to avoid recomputing context during A/B tests, and normalized artifacts to reduce per-experiment costs.

How the Learning and Iteration agent maintains high reconciliation efficiency

Through continuous learning and iteration, agents observe and mimic the decisions of AP reviewers. Since the system was adopted in real-world scenarios, the amount of human involvement gradually went down by 68% and the average posting speed improved by 55%.

Transform your financial operations with a custom multi-agent reconciliation platform built for your business

How we build AI agents

The takeaway

Compound AI systems deliver quantifiable improvements in multi-step workflows. Our invoice reconciliation implementation produced sustained performance gains, with APs now spending just 5 minutes on average to reconcile an invoice compared to much longer times before automation.

This project demonstrated that Databricks offers a comprehensive toolset for building scalable, cost-effective compound AI systems. The platform’s integrated components, from Auto Loader and Delta Live Tables to Model Serving and Workflows, work together seamlessly without requiring complex integrations.

For TCO optimization, workflow orchestration delivered the biggest impact. Fine-tuning batch sizes, trigger intervals, and task coordination reduced both compute waste and processing bottlenecks.

However, the most reliable cost control came from managing resource consumption directly: QPS caps prevent runaway spending from traffic spikes, while auto-scaling ensures you pay only for resources actually needed.

The key takeaway is that compound AI success depends as much on infrastructure discipline as it does on model performance. Get the orchestration and resource management right, and the AI capabilities can deliver their full potential at predictable costs.