By continuing to browse this website, you agree to our use of cookies. Learn more at the Privacy Policy page.
Contact Us
Contact Us
Medallion architecture

Medallion architecture

Medallion architecture is a data design pattern that organizes data into progressive layers of increasing quality and refinement. Data enters the system in its raw form, passes through cleaning and transformation stages, and emerges ready for business consumption. The three standard layers, named bronze, silver, and gold after Olympic medals, represent this progression from unprocessed to analytics-ready data.

Databricks popularized the term in the 2010s, though the underlying concept of staged data refinement predates the specific naming convention. The pattern applies primarily to data lakehouses where organizations need to balance raw data preservation with curated analytical datasets. Medallion architecture provides a mental model for organizing this progression and establishing clear responsibilities at each stage.

The architecture follows ELT (Extract, Load, Transform) rather than traditional ETL methodology. Data lands in the bronze layer with minimal transformation, preserving source fidelity. Transformations happen progressively as data moves through subsequent layers, allowing flexibility to reprocess if business logic changes.

The three layers explained

Each layer serves a distinct purpose in the data refinement process. Understanding these purposes prevents the common mistake of treating layers as arbitrary staging areas.

Bronze layer: raw data preservation

The bronze layer captures data exactly as it arrives from source systems. This layer serves as the system of record, preserving the original data for auditing, compliance, and reprocessing needs.

Bronze data typically has minimal schema enforcement. Databricks recommends storing most fields as strings or variant types to protect against upstream schema changes that would otherwise break ingestion. The goal is reliable capture, not data quality.

Common bronze layer characteristics include append-only writes that preserve historical records, partitioning by ingestion timestamp rather than business attributes, metadata columns tracking source system, ingestion time, and batch identifiers, and retention policies that balance storage costs against reprocessing needs.

Bronze data is not intended for direct consumption by analysts or data scientists. Queries against bronze tables are typically limited to data engineers debugging pipeline issues or investigating data quality problems.

Silver layer: cleaned and conformed data

The silver layer applies cleaning, validation, and conforming transformations that make data usable for analytical workloads. This layer produces an enterprise view of data that serves as the foundation for downstream analysis.

Silver transformations typically include data type casting and format standardization, null handling and default value application, deduplication and record matching across sources, referential integrity validation, and business key generation and surrogate key assignment.

The silver layer often follows third normal form or similar normalized modeling approaches. This normalization enables flexible querying across different analytical use cases without pre-committing to specific aggregation patterns.

Silver data serves data scientists building models, analysts performing ad-hoc exploration, and downstream gold layer transformations. Access controls typically open silver data to a broader audience than bronze while still restricting access to particularly sensitive datasets.

Gold layer: business-ready aggregates

The gold layer contains data optimized for specific business consumption patterns. This layer applies business logic, aggregations, and denormalization that make data immediately useful for reporting, dashboards, and applications.

Gold transformations typically include business rule application and metric calculation, dimensional modeling with fact and dimension tables, pre-aggregation at commonly-requested granularities, join denormalization for query performance, and slowly changing dimension handling for historical analysis.

Gold tables are often organized by business domain or consuming application. A finance domain might have revenue facts and customer dimensions. A marketing domain might have campaign performance aggregates. This organization aligns data with the teams that consume it.

Gold data serves business users directly through BI tools, dashboards, and reports. Query patterns are predictable, allowing optimization for specific access patterns rather than general-purpose flexibility.

Defining clear layer boundaries

One of the most common medallion implementation failures stems from ambiguous layer definitions. Teams interpret bronze, silver, and gold differently, leading to inconsistent architectures that undermine the pattern’s benefits.

The silver layer ambiguity problem

Silver is the least clearly defined layer in most medallion implementations. It frequently becomes a catch-all for transformations that do not fit neatly into bronze or gold. This ambiguity creates several problems.

Transformation scope creep occurs when teams add business logic to silver that belongs in gold, or defer cleaning to gold that should happen in silver. Over time, the silver layer accumulates complexity without clear organizing principles.

Inconsistent modeling emerges when different data engineers interpret silver differently. Some create highly normalized models. Others create semi-denormalized structures. The resulting inconsistency makes silver data harder to use and maintain.

Performance degradation follows complexity growth. Silver layers that try to do too much become slow and expensive, especially when transformations that should be gold-layer aggregations run against every silver refresh.

Establishing layer contracts

Clear layer contracts prevent ambiguity. Define explicitly what transformations belong in each layer and enforce these definitions through code review and automated validation.

Bronze contracts should specify that data arrives with source schema preserved, ingestion metadata added, and no business logic applied. Any filtering, deduplication, or transformation moves to silver.

Silver contracts should specify that data has been typed, validated, deduplicated, and conformed to enterprise standards. Business calculations, aggregations, and dimensional modeling remain in gold.

Gold contracts should specify that data is ready for consumption by specific business applications or domains. Transformations optimize for query patterns rather than general flexibility.

Document these contracts and review them with data consumers. When disagreements arise about where transformations belong, the contracts provide a reference point for resolution.

Implementation patterns

Successful medallion implementations share common patterns that address practical challenges.

Physical versus logical separation

Teams must decide whether to separate layers physically (different storage locations, databases, or accounts) or logically (different schemas or naming conventions within shared infrastructure).

Physical separation provides stronger isolation for security and compliance. Access controls apply at the infrastructure level. Cost attribution tracks cleanly to specific layers. Failures in one layer cannot directly impact others.

Logical separation simplifies cross-layer queries and reduces data movement. Development workflows are faster when engineers can query across layers without federated queries or data copying.

Most production implementations use physical separation for bronze-to-silver boundaries (protecting raw data) and logical separation for silver-to-gold boundaries (enabling flexible transformation development).

Partitioning strategies

Effective partitioning dramatically impacts both cost and performance.

Bronze layers are typically partitioned by ingestion date. This approach aligns with append-only write patterns and enables efficient data lifecycle management. Old partitions can be archived or deleted based on retention policies.

Silver layers partition by business attributes that align with common query patterns. For transactional data, partition by transaction date. For customer data, partition by region or customer segment. Choose attributes that appear frequently in WHERE clauses.

Gold layers partition based on consumption patterns. If dashboards filter by month and region, partition by those dimensions. If reports aggregate by product category, include category in the partition scheme.

Incremental processing

Batch reprocessing of entire layers is expensive and slow. Incremental processing updates only changed records, dramatically reducing compute costs and refresh latency.

Delta Lake, Apache Iceberg, and Apache Hudi provide merge capabilities that enable incremental updates. Change data capture from source systems feeds incremental refreshes. Watermarking tracks which source records have been processed.

Design silver and gold transformations to handle incremental inputs. Aggregations that cannot be incrementally updated (like distinct counts) require careful handling, either through approximate algorithms or periodic full refreshes.

Common anti-patterns

Medallion implementations fail for predictable reasons. Recognizing these anti-patterns helps teams avoid common mistakes.

Bronze as the only source of truth

Some teams treat bronze as the definitive source for all downstream queries, forcing consumers to navigate raw, unvalidated data. This approach defeats the purpose of staged refinement.

Bronze should be the source of truth for what was received, not for what is consumed. Silver and gold layers exist precisely to provide cleaner, more usable views. Consumers should rarely query bronze directly.

Gold tables per report

Creating separate gold tables for each report or dashboard leads to explosion of nearly-identical tables. Maintenance becomes impossible as business logic duplicates across dozens of tables that gradually diverge.

Design gold tables around business domains and common dimensional models, not around specific reports. Multiple reports should share underlying fact and dimension tables, with report-specific logic in the visualization layer rather than the data layer.

Skipping layers for convenience

Engineers sometimes bypass silver, loading cleaned data directly from bronze to gold. This shortcut creates problems when multiple gold tables need the same cleaning logic, duplicating effort and creating inconsistency.

Resist the temptation to skip layers. The staged approach exists to promote reuse and consistency. If cleaning logic appears in gold transformations, refactor it into silver.

Ambiguous ownership

Without clear ownership, layers accumulate technical debt. Nobody feels responsible for optimizing slow queries, cleaning up unused tables, or updating outdated transformations.

Assign explicit ownership for each layer, ideally at the table level. Owners are responsible for quality, performance, and documentation. When tables have no clear owner, they should be candidates for deprecation.

Extending beyond three layers

The standard three-layer model does not fit every use case. Some organizations extend medallion architecture with additional layers.

Pre-bronze staging

High-velocity streaming sources may need a staging layer before bronze. This pre-bronze layer handles initial deduplication, format conversion, and schema detection before data lands in the persistent bronze layer.

Kafka topics, cloud storage buckets, or landing zones serve as pre-bronze staging. Data flows through quickly, with minimal retention. The bronze layer captures the stable, deduplicated version.

Platinum for ML and real-time

Some organizations add a platinum layer for advanced analytics and machine learning. Platinum consumes from gold but optimizes for specific ML requirements: feature engineering, training dataset creation, and model serving.

Platinum also addresses real-time use cases that gold’s batch orientation cannot serve. Streaming aggregations, real-time feature computation, and low-latency serving endpoints live in platinum.

The platinum concept remains less standardized than bronze, silver, and gold. Organizations implementing platinum should define clear contracts that distinguish it from gold.

Medallion architecture for streaming data

Traditional medallion architecture assumes batch processing. Streaming data requires adaptations that maintain medallion principles while supporting continuous data flow.

Streaming bronze ingestion

Streaming bronze replaces batch file ingestion with continuous event consumption. Kafka, Kinesis, or Pulsar feed data into bronze tables through streaming writes.

Delta Lake’s streaming capabilities enable append-only bronze tables that update continuously. Watermarking tracks processing progress. Checkpointing ensures exactly-once semantics despite failures.

Streaming silver transformations

Silver transformations can run as streaming jobs that process bronze changes incrementally. Structured Streaming in Spark, Flink, or similar engines apply cleaning and validation logic to event streams.

Stateful transformations like deduplication and sessionization require careful state management. State stores grow over time and need compaction. Late-arriving data needs handling through watermarks and allowed lateness windows.

Micro-batch versus true streaming

True streaming provides lowest latency but highest complexity. Micro-batch processing (small batches every few seconds or minutes) offers a middle ground that many organizations find sufficient.

Evaluate latency requirements carefully. If business needs can tolerate minute-level delays, micro-batch simplifies implementation significantly compared to true streaming.

Industrial and IoT data considerations

Sensor data, equipment telemetry, and industrial IoT present unique challenges for medallion architecture.

High-frequency time series

Industrial sensors generate data at millisecond intervals. Bronze layers must handle write volumes that dwarf typical transactional systems. Time-series optimized storage formats and partitioning by time windows become essential.

Silver transformations often include downsampling that reduces resolution for historical data while maintaining full resolution for recent periods. Aggregation functions preserve statistical properties while reducing storage and query costs.

Edge versus cloud processing

Some industrial environments require processing at the edge before data reaches cloud-based medallion layers. Limited connectivity, latency requirements, or data volume constraints drive edge processing needs.

Edge bronze layers capture raw sensor data locally. Edge silver layers apply initial cleaning and filtering. Only summarized or anomalous data transmits to cloud infrastructure. This hybrid approach reduces bandwidth costs while maintaining auditability.

Equipment-oriented modeling

Industrial data often organizes around equipment hierarchies rather than transactional entities. Plants contain lines, lines contain machines, machines contain sensors. Medallion layers should reflect these hierarchies.

Gold layers for industrial data typically include equipment state tables, maintenance event facts, and production metric aggregates. These structures support predictive maintenance, quality analysis, and operational optimization use cases.

When medallion architecture fits

Medallion architecture suits organizations with diverse data sources requiring progressive refinement, compliance requirements demanding raw data preservation, multiple teams consuming data at different quality levels, and batch or micro-batch processing patterns.

The pattern works well for analytical workloads where latency tolerance allows staged processing. Data warehousing, business intelligence, and machine learning training datasets align naturally with medallion’s progressive refinement model.

When to consider alternatives

Medallion architecture is not universal. Some situations call for different approaches.

Real-time operational systems that cannot tolerate batch latency may need event-driven architectures rather than staged refinement. Medallion can complement these systems but should not replace them for low-latency use cases.

Simple data pipelines with single sources and single consumers may not need three-layer complexity. If data flows from one source to one destination with straightforward transformations, simpler architectures reduce overhead.

Data mesh implementations may organize around domain data products rather than centralized layers. Medallion can exist within individual domains, but the organizational principle shifts from layers to products.

Small teams with limited data engineering capacity may struggle to maintain three-layer architectures. The overhead of defining, implementing, and operating multiple layers requires investment that smaller organizations may not be able to sustain.

Xenoss data pipeline engineering teams help enterprises design and implement medallion architectures that balance raw data preservation with business-ready analytics. From initial layer design through production optimization, our engineers bring experience from Fortune 500 data transformations to your lakehouse implementation.

Back to AI and Data Glossary

FAQ

icon
What tools support medallion architecture implementation?

Databricks provides native support through Delta Lake and Unity Catalog. Microsoft Fabric implements medallion architecture through OneLake lakehouses. Snowflake, Apache Spark with Iceberg or Hudi, and cloud-native data lakes (AWS Lake Formation, Azure Synapse, Google BigQuery) all support medallion patterns. The architecture is conceptually platform-agnostic, though specific implementations vary by tooling.

How does medallion architecture work with data mesh?

Medallion architecture can exist within individual domains in a data mesh. Each domain might maintain its own bronze, silver, and gold layers for domain-specific data. The gold layer often produces the data products that domains publish for consumption by other domains. Medallion provides the internal organization principle; data mesh provides the cross-domain ownership and governance model.

Can I skip the silver layer and go directly from bronze to gold?

Technically yes, but this approach creates problems. When multiple gold tables need the same cleaning logic, you duplicate effort and risk inconsistency. The silver layer exists to provide a clean, reusable foundation that multiple gold tables can build upon. Skip silver only when you have a single, simple use case that will never expand.

What is the difference between medallion architecture and data lakehouse?

Data lakehouse is a platform architecture that combines data lake storage with data warehouse capabilities. Medallion architecture is a design pattern for organizing data within a lakehouse. A lakehouse provides the underlying infrastructure (Delta Lake, Iceberg, or Hudi tables with ACID transactions and schema enforcement). Medallion architecture provides the organizational principle (bronze, silver, gold layers) for structuring data within that infrastructure.

Let’s discuss your challenge

Schedule a call instantly here or fill out the form below

    photo 5470114595394940638 y