📦 Deduplicating Data in BigQuery

Duplicate data can sneak into your datasets in many ways: bad joins, ingestion issues, or data entry errors. In BigQuery, deduplicating data is a common and essential task, especially when working with large datasets where performance matters.

In this post, we will explore multiple methods to deduplicate data in BigQuery.

🚩 When Do You Need Deduplication?

Imagine a table orders_raw where each order might appear multiple times:

order_id	customer_id	order_date	amount
101	1	2025-07-01 10:00:00	50
101	1	2025-07-01 10:00:00	50
102	2	2025-07-01 11:00:00	30

You want only one row per order_id, keeping the most recent or largest record (or just any if they’re identical).

🏗️ Method 1: Using a Subquery + `ROW_NUMBER()`

Before QUALIFY was introduced, you’d typically handle this with a subquery and window function.

WITH ranked_orders AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) AS rn
  FROM orders_raw
)
SELECT *
FROM ranked_orders
WHERE rn = 1;

Explanation:

Assigns a row number per order_id, ordered by order_date (most recent first).
Keeps only the top-ranked row.

✅ Good for: compatibility across SQL engines, and when chaining multiple steps.

🔄 Method 2: Using `GROUP BY` with `MAX()` or `ANY_VALUE()`

When you’re summarizing data and only care about specific fields, GROUP BY with aggregates works well.

SELECT order_id,
       MAX(order_date) AS latest_order_date,
       ANY_VALUE(customer_id) AS customer_id,
       MAX(amount) AS max_amount
FROM orders_raw
GROUP BY order_id;

MAX(order_date): gets the most recent date.
ANY_VALUE(): returns any one of the values in the group (use with care).

✅ Good for: fast aggregations, dashboards, or non-critical deduplication.

🧱 Method 3: Using `ARRAY_AGG(... ORDER BY ...)` with `LIMIT 1`

If you want to return the entire row of the most recent or largest entry:

SELECT AS VALUE
  ARRAY_AGG(o ORDER BY order_date DESC LIMIT 1)[OFFSET(0)]
FROM orders_raw o
GROUP BY order_id;

Explanation:

ARRAY_AGG(... LIMIT 1) gathers only the top row (by order date).
[OFFSET(0)] extracts the first and only row from the array.

✅ Good for: keeping full rows, more flexible than GROUP BY.

✅ Method 4: Using `QUALIFY` with `ROW_NUMBER()` (Most Elegant)

BigQuery’s QUALIFY clause allows filtering after window functions—eliminating the need for subqueries.

SELECT *
FROM orders_raw
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) = 1;

Explanation:

ROW_NUMBER() assigns a rank per order_id.
QUALIFY filters to keep only the top-ranked row per group.

✅ Pros:

Clean and expressive.
No CTE or subquery required.
Great for readable production SQL.

🔚 Summary: Which Should You Use?

Method	Best For	Readability	Performance
Subquery + `ROW_NUMBER`	Legacy systems or complex transformations	Medium	High
`GROUP BY` + aggregates	Simple summaries	✅ High	✅ High
`ARRAY_AGG()` + `LIMIT`	Full row retention with custom logic	Medium	Medium
`QUALIFY` + `ROW_NUMBER`	Clean, modern deduplication on BigQuery	✅ High	✅ High

🧼 Final Thoughts

Deduplicating data is a routine but vital task. BigQuery gives you a flexible toolbox, whether you prefer legacy compatible approaches or modern, concise solutions like QUALIFY.

If you’re using BigQuery, try refactoring old subquery based deduplication into QUALIFY for a cleaner, easier to maintain workflow.

🚩 When Do You Need Deduplication?

🏗️ Method 1: Using a Subquery + ROW_NUMBER()

🔄 Method 2: Using GROUP BY with MAX() or ANY_VALUE()

🧱 Method 3: Using ARRAY_AGG(... ORDER BY ...) with LIMIT 1

✅ Method 4: Using QUALIFY with ROW_NUMBER() (Most Elegant)

🔚 Summary: Which Should You Use?

🧼 Final Thoughts

💬 Got a preferred method or tricky deduping scenario? Drop a comment or share your query!

🏗️ Method 1: Using a Subquery + `ROW_NUMBER()`

🔄 Method 2: Using `GROUP BY` with `MAX()` or `ANY_VALUE()`

🧱 Method 3: Using `ARRAY_AGG(... ORDER BY ...)` with `LIMIT 1`

✅ Method 4: Using `QUALIFY` with `ROW_NUMBER()` (Most Elegant)