Our daily_user_activity model was the perfect candidate.
Instead of materializing it as a table, we changed the config to materialized='incremental' and incremental_strategy='merge'.
This tells dbt to use BigQuery’s highly efficient MERGE command, which can surgically UPDATE or INSERT rows instead of doing a costly full rewrite. We also explicitly configured the target table to be partitioned by the activity_date and clustered by user_id.
But the real “aha!” moment came from using the is_incremental() macro.
Inside the model’s SQL, we wrapped our main transformation in a block that started with {% if is_incremental() %} . Inside that block, we added a WHERE clause that was deceptively simple: WHERE activity_date > (SELECT MAX(activity_date) FROM {{ this }}). On the first run, this clause is ignored, and the model builds its full history.
But on every subsequent run, that clause tells BigQuery, “Hey, don’t you dare scan all 50 terabytes. Just look at the source data that’s newer than what you already have”
Our daily data scan dropped from 50 terabytes to a few hundred gigabytes. Our costs plummeted by over 98%, and our pipelines ran faster.
It was a hard-won, but invaluable, lesson.
The Code That Saved My Budget:
{{
config(
materialized='incremental',
unique_key='unique_daily_user_id', -- Important for MERGE
incremental_strategy='merge', -- Use BigQuery's efficient MERGE statement
partition_by={
"field": "activity_date",
"data_type": "date",
"granularity": "day"
},
cluster_by = ["user_id"] -- Cluster for faster lookups on user
)
}}
SELECT
FARM_FINGERPRINT(CONCAT(CAST(activity_date AS STRING), CAST(user_id AS STRING))) as unique_daily_user_id,
activity_date,
user_id,
COUNT(DISTINCT session_id) as session_count,
COUNT(*) as event_count
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
-- This is the crucial part: only scan the partitions you need
WHERE activity_date > (SELECT MAX(activity_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2, 3