Go for Gold: Towards the Medallion Architecture

What is Medallion Architecture?

Medallion architecture is a data design pattern used to organise data in ‘layers’. The goal is to improve the quality and structure of the data with each successive layer.

These layers are often referred to as Bronze, Silver and Gold layers:

Bronze 🟠 - Landing zone for raw data

Silver ⚪️ - Standardised, deduplicated and validated data

Gold 🟡 - Aggregated, enriched and available for analysis / modelling

At the crux of the medallion architecture is the concept of a data lakehouse, as the architecture is designed specifically to be used within a lakehouse setting.

What is a Data Lakehouse?

Data lakes came around from a need to store large amounts of unstructured and structured data for analysis, data science and machine learning. Although excellent for storage, they weren’t very good for doing anything with the data - the data needed to be taken through a pipeline into a more traditional database for transformation. At its core, think of a data lake as a collection of folders to keep all kinds of data in.

A data lakehouse seeks to merge the best parts of a data warehouse and a data lake. More specifically the utilising the low-cost storage benefits of a data lake that can store many types of data (structured, unstructured etc); and the superior structure and governance of a data warehouse.

A lakehouse has a SQL query engine, which is a feature not in data lakes.
Metadata layers also feature in the lakehouse - data about the data stored in the data lake.

Layers 🟠 ⚪️ 🟡

The 3 layers are very similar in concept to methods that may already being used, for example, bigquery architecture implements:

raw ⇒[staging]⇒ warehouse ⇒[staging]⇒ analytics/semantic

and this is quite similar to our medallion layers:

bronze ⇒ silver ⇒ gold

Please note that medallion architecture is NOT a replacement for data models such as star schemas etc, this means that your silver layer may adopt a star schema approach whilst your gold layer may be modelled with the OBT approach.

Databricks

Databricks is a company with an analytics platform also called ‘Databricks”.

There is some history behind the creation of Databricks and subsequently the inception of the data lakehouse.

Long ago (pre-2005), database systems ran on single computers with some associated storage system. The speed and amount of data held was dependant on the computer itself - Hadoop introduced horizontal scaling of database systems. You could connect together 10 or 50 or 100 regular Intel computers each running Linux and Hadoop.

The two key components of Hadoop were a distributed file system (HDFS) that treated all the disks as one storage system and a programming framework (MapReduce) that allowed you to write a program to run across many CPUs.

Hive was introduced as a SQL abstraction on top of HDFS and MapReduce that allows you to pretend that your data is in normal relational-like tables and to write database jobs in the form of SQL statements. It is essentially a compiler that translates SQL into Java MapReduce code.

Spark was created in 2011 as a faster alternative to Hive and Databricks is Spark, but with a nice GUI on top and many automated easy-to-use features.

Medallion Vs Kimball (Star Schema)

As mentioned before - medallion architecture is not a replacement to data modelling techniques, instead it is a step before (or a level higher up).

It would be more appropriate to assign a Star Schema data modelling approach to a particular layer, as below:

LAYER	DESCRIPTION	DATA MODEL
Bronze 🟠	Landing zone for raw data	-
Silver ⚪️	Standardised, deduplicated and validated data	“Agnostic” Modelling
Gold 🟡	Aggregated, enriched and available for analysis / modelling	Star Schema with Fact Constellation

The models above are the proposed data models.

Data Marts & Medallion Architecture

A data mart is a simpler form of data warehouse focused on a single subject or line of business. With a data mart, teams can access data and gain insights faster, because they don’t have to spend time searching within a more complex data warehouse or manually aggregating data from different sources. An example of a data mart may be ‘finances’, ‘customers’, ‘departments’.

the relationship between medallion architecture and data marts, is one of parent-child. That is, medallion architecture is a level above data marts, signifying the overall architecture of each layer; whilst each layer will have its own form of data modelling. Best practices denote that the analytical / semantic layer can be modelled using data marts as shown in the table above.

Advantages & Challenges of Medallion Architecture

Advantages:

Simple data model: Medallion architectures are familiar to many who have used dbt or warehouse data staging techniques like source ➡️ stage ➡️ curated.
Logical Progression of data improvements: Each stage in a medallion architecture follows a common sense, logical pattern that is easy to follow, and it is easy to see the modular improvements in the data.
Reproducibility: All data is stored in raw tables - this means that the entire architecture can be reproduced and recovered.

Disadvantages:

Is not a replacement for data modelling: Schemas & Tables must still be modelled within the medallion architecture.
Storage: Although storage is cheaper, it does require a large amount, and will in theory, at least triple costs.
Implies the use of a Lake house: Having a data lake house is almost a prerequisite to implementing the medallion architecture. If one does not have this already setup, not intends to, it becomes less relevant.

Data Mesh

Data mesh is a type of data architecture where the data is treated as a ‘product’. This means that each business unit would ‘own’ and control the data related to their function - finance owns the finance data, product team owns product data, HR owns HR data and so on. In this way, it is considered decentralised.

Data Mesh Principles:

Distributed domain-driven architecture

Data is the product

Self-serve data infrastructure

Autonomous data governance

Normally, in an organisation a central data team:

Ingest the data from all the different business units.
Transform the data into a consistent, trustworthy, and useful format. For example, the team could make sure all dates in the system are in a common format.
Prepare the data for data consumers.

There are some issues with this approach, however…

Isolated data team ⇒ They will be unaware of the motivation and needs of the business unit, simply providing information with no clear context as to what actually needs doing.
Slow to change ⇒ Any changes would need the entire pipeline to be looked at and changed, this will take a long time. Furthermore, even agreeing these changes can be time consuming.
Reduced accuracy ⇒ As a result of 1. (being isolated), a lack of understanding of the business unit means the information will be less accurate, DEs are unaware of whats happening on the ground.

The data mesh is all about autonomy, ownership and decentralisation, as a result the benefits seen are many:

Democratising data ⇒ Data consumers also request access to the data products and seek approvals or changes directly from data owners. As a result, everyone gets faster access to relevant data, and faster access improves business agility.
Flexibility ⇒ More flexibility removes central data pipelines and reduces operational bottlenecks and technical strains on the system. It also means less strain on a central data team.
Improved data discovery ⇒ Prevents data silos, central data management framework.

dbt Mesh

As seen above, the data mesh architecture is better for larger, enterprise organisations with large business units. dbt mesh is meant to make the scaling of dbt in the enterprise setting easier, more robust and less repetitive.

dbt is used commonly throughout small and large organisations, but to date, there hasn’t been a way to effectively manage the dependencies, governance, and workflows between multiple dbt projects.

dbt Mesh helps data teams to work independently and collaboratively; sharing data, code, and best practices without sacrificing security or autonomy.

Who is dbt Mesh for?

The multi-project architecture helps organizations with mature, complex transformation workflows in dbt increase the flexibility and performance of their dbt projects. If you’re already using dbt and your project has started to experience any of the following, you’re likely ready to start exploring this paradigm:

The number of models in your project is degrading performance and slowing down development.

Teams have developed separate workflows and need to decouple development from each other.

Security and governance requirements are increasing and would benefit from increased isolation.

https://docs.getdbt.com/best-practices/how-we-mesh/mesh-1-intro

Medallion Architecture

Using the existing ELT template, we have repurposed this into the medallion architecture, below is the current best practice :

When Should we use this Architecture?

Medallion architecture, whilst similar to the model that we already utilises, does have differences, mostly in that it requires the use of a data lakehouse.

❓Is there large amounts of unstructured & semi-structured data?

Why would we need to focus on a Lakehouse if all the data is structured, and can simply be dropped into a bigquery table? Thus, if the data exists in a form other than traditional, structured data - medallion architecture could work

❓Do you have a Lakehouse & a separate warehouse?

If yes, medallion architecture is ideal - it would eliminate the need for a separate warehouse when it can be done within the lakehouse

❓Do you store all data (append only)?

If yes, this would allow the lakehouse to recreate all downstream models, which is a huge positive to avoid catastrophic failure occurences.

Data Monetisation

Data monetisation is fast becoming important for data teams to show tangible revenue-generating outputs. To do this with a pre-existing data architecture and model is ideal, as the work is lessened.

Data can be monetised much like any other service:

bronze ⇒

APIs are a service that many companies provide, with tiers being based on how many calls are allowed per day, Twitter (or X) as an example below:

silver ⇒

Where there is more standardised, structured data, data can be ‘sold’ on various marketplaces as flat files, an example below:

The marketplace allows for the acquisition of data in traditional star schemas or OBT format, this is also more geared towards ad-hoc or less frequent data needs than that of an API.

gold ⇒

Finally - the gold layer can provide data in the form of dashboards, where the data can be visualised. Tiers can also be applied to the dashboards to allow access to elements of the dashboards that are locked with the lower tiers.

Most forms of data monetisation would be service based, access to the data / dashboard would be paid for on a monthly basis. Looker is ideal for this kind of function as it can reduce the scalability issues by simply restricting access to the data based on who is logged in.

Appendices

https://www.advancinganalytics.co.uk/medallion

https://www.databricks.com/glossary/medallion-architecture

https://medium.com/@chuck.connell.3/databricks-a-history-d8dd12fe9695

https://piethein.medium.com/medallion-architecture-best-practices-for-managing-bronze-silver-and-gold-486de7c90055

https://aws.amazon.com/what-is/data-mesh/

https://docs.getdbt.com/best-practices/how-we-mesh/mesh-1-intro