What is Dataform?
Dataform is Google Cloud’s transformation framework to support the “T” in ELT workloads. It offers serverless orchestration, helps build transformation workflows with just SQL, augments SQL with the programmatic power of Javascript, helps test data with assertions and also allows adding metadata to data such as labels, policy tags and documentation.
It owns the workload from the data lake to the creation of marts. Within these stages, it is possible to test data, document data, tag data etc. And all the created data objects can be run as part of SQL workflow courtesy a simple button click. This can also be orchestrated using either Cloud Workflows + Cloud Scheduler or Cloud Composer.
Figure 1. Dataform within the MDS
Key Features
- Easy business logic governance
- Write SQL SELECT statements without worrying about DDL
- Document your tables and columns
- Test your data using assertions
- Automatic DAG Generation
- Dataform can detect dependencies automatically when the ref() function is used
- Collaborative development with version control
- Connect to a 3rd party git repo (GitHub, Bitbucket or Gitlab)
- Develop your changes on branches, allowing a Git development process (version control & Pull Requests), with development workspaces
- Simple Workflow Orchestration
- Use Release Configurations and Workflow Configurations to orchestrate
- Keep code DRY (don’t repeat yourself)
- Make use of Javascript to reduce the duplication in SQL code
Components of a Dataform Project
SQLX Files
- Primary building block of a Dataform project
- Similar to SQL scripts with a few differences
- No need for DDL, only SELECT statements (Dataform adds DDL)
- Can contain Dataform specific JS codeblocks
- Have the extension .sqlx
- All valid .sql files are also valid .sqlx files - However, this is not recommended for the following reasons
- A .sql file will not accept the config block. Without the config block, it won’t be materialised as a data model. So, you will have to write DDL for it. Dataform already does that through .sqlx
- If the intention is to have a SQL statement such as DELETE which does not create a table, you can still achieve that with a .sqlx file by setting type: “operation” in the config block
- And having both .sqlx and .sql files could lead to confusions. So, best to avoid .sql files
Figure 2. Example of a .sqlx file. Two main parts - The config block and the rest of the business logic in SQL + JS
What kind of entities can you create using the .sqlx file?
In dataform, the .sqlx file extension is just another file extension that can create useful entities within your dataform project. These “entities” are determined by the value for the type: property in the {config} block.
Data Source Declarations
Declare BigQuery data sources that can be used within the pipeline later on
These data sources are typically raw data that has been ingested through data ingestion processes, and also rest within BigQuery as BigQuery tables
config {
type: "declaration",
database: "...",
schema: "...",
name: "..."
} Data Models
- Tables or views that contain data
- Tables can reference data sources or other tables using the
${ref("reference_table")}function - https://cloud.google.com/dataform/docs/tables
// table
config {
type: "table"
}
// view
config {
type: "view"
}
// materialized view
config {
type: "materialized"
}
// incremental table
config {
type: "incremental",
uniqueKey: "employee_id"
} An incremental table is used when we deal with large tables that you don’t want to re-build every time they are run.
Key benefits of using this as summarised by Google Cloud
- Performance optimization - For some kinds of data, such as web logs or analytics data, you might want to only process new records instead of reprocessing the entire table.
- Latency reduction - You can use incremental tables to execute workflows quickly but frequently, reducing the downstream latency of the output tables.
- Daily snapshots - You can configure an incremental table to create daily snapshots of the table data, for example, for longitudinal analysis of user settings stored in a production database.
The uniqueKey property helps ensure there are no duplicates in the data while performing MERGE UPSERT (default behaviour of dataform incremental models)
If using materialized views, link to this document.
Introduction to materialized views | BigQuery | Google Cloud
Assertions
- Used to test data quality
- Runs as part of the pipeline
- https://cloud.google.com/dataform/docs/assertions
config {
type: "assertion"
} Custom SQL Operations
- Custom SQL statements that don’t create a table
- Eg: A DELETE operation
- https://cloud.google.com/dataform/docs/custom-sql
config {
type: "operation"
} NOTE: Operations are not guaranteed to run as dependencies as their usage is usually for running some SQL that is not directly part of the pipeline. Eg: Cleaning up stray tables
Also, there are pre_operations and post_operations (written within a .sqlx data model), that can execute SQL “before” or “after” the data model is materialised in BigQuery. Eg: To grant or block specific users’ access to a data model, you would have this “Access Grants Provisioning” SQL as a post_operation. More detailed info about this and the difference with the type: "operation:" sqlx file can be found in this discussion thread here - https://www.googlecloudcommunity.com/gc/Data-Analytics/post-operations-or-config-type-quot-operations-quot-in-GCP/m-p/611157
SQLX Configuration Blocks
The config block for a data model can have an extensive list of properties, including
- Type
- Schema
- Description
- Column names and descriptions
- BigQuery specific
- Partitioning
- Clustering
- Labels
- Policy Tags
- Assertions
Javascript Files
A detailed guide on JS for Dataform is provided.
Other files
dataform.json- Project-level configurations.package.json- Installed packages within your dataform project
Dataform Version
This section is for those who are interested in understanding Dataform releases and the latest version you should be using.
Dataform release notes | Google Cloud
It is recommended to keep version up-to-date, after clear inspection of the releases page that there will be no breaking changes (or if unavoidable, then a plan to fix them with limited interruption to users).