Duplicate data can sneak into your datasets in many ways: bad joins, ingestion issues, or data entry errors. In BigQuery, deduplicating data is a common and essential task, especially when working with large datasets where performance matters.
In this post, we will explore multiple methods to deduplicate data in BigQuery.
🚩 When Do You Need Deduplication?
Imagine a table orders_raw where each order might appear multiple times:
| order_id | customer_id | order_date | amount |
|---|---|---|---|
| 101 | 1 | 2025-07-01 10:00:00 | 50 |
| 101 | 1 | 2025-07-01 10:00:00 | 50 |
| 102 | 2 | 2025-07-01 11:00:00 | 30 |
You want only one row per order_id, keeping the most recent or largest record (or just any if they’re identical).
🏗️ Method 1: Using a Subquery + ROW_NUMBER()
Before QUALIFY was introduced, you’d typically handle this with a subquery and window function.
WITH ranked_orders AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) AS rn
FROM orders_raw
)
SELECT *
FROM ranked_orders
WHERE rn = 1; Explanation:
- Assigns a row number per
order_id, ordered byorder_date(most recent first). - Keeps only the top-ranked row.
✅ Good for: compatibility across SQL engines, and when chaining multiple steps.
🔄 Method 2: Using GROUP BY with MAX() or ANY_VALUE()
When you’re summarizing data and only care about specific fields, GROUP BY with aggregates works well.
SELECT order_id,
MAX(order_date) AS latest_order_date,
ANY_VALUE(customer_id) AS customer_id,
MAX(amount) AS max_amount
FROM orders_raw
GROUP BY order_id; MAX(order_date): gets the most recent date.ANY_VALUE(): returns any one of the values in the group (use with care).
✅ Good for: fast aggregations, dashboards, or non-critical deduplication.
🧱 Method 3: Using ARRAY_AGG(... ORDER BY ...) with LIMIT 1
If you want to return the entire row of the most recent or largest entry:
SELECT AS VALUE
ARRAY_AGG(o ORDER BY order_date DESC LIMIT 1)[OFFSET(0)]
FROM orders_raw o
GROUP BY order_id; Explanation:
ARRAY_AGG(... LIMIT 1)gathers only the top row (by order date).[OFFSET(0)]extracts the first and only row from the array.
✅ Good for: keeping full rows, more flexible than GROUP BY.
✅ Method 4: Using QUALIFY with ROW_NUMBER() (Most Elegant)
BigQuery’s QUALIFY clause allows filtering after window functions—eliminating the need for subqueries.
SELECT *
FROM orders_raw
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) = 1; Explanation:
ROW_NUMBER()assigns a rank perorder_id.QUALIFYfilters to keep only the top-ranked row per group.
✅ Pros:
- Clean and expressive.
- No CTE or subquery required.
- Great for readable production SQL.
🔚 Summary: Which Should You Use?
| Method | Best For | Readability | Performance |
|---|---|---|---|
Subquery + ROW_NUMBER | Legacy systems or complex transformations | Medium | High |
GROUP BY + aggregates | Simple summaries | ✅ High | ✅ High |
ARRAY_AGG() + LIMIT | Full row retention with custom logic | Medium | Medium |
QUALIFY + ROW_NUMBER | Clean, modern deduplication on BigQuery | ✅ High | ✅ High |
🧼 Final Thoughts
Deduplicating data is a routine but vital task. BigQuery gives you a flexible toolbox, whether you prefer legacy compatible approaches or modern, concise solutions like QUALIFY.
If you’re using BigQuery, try refactoring old subquery based deduplication into QUALIFY for a cleaner, easier to maintain workflow.