📦 Deduplicating Data in BigQuery

Duplicate data can sneak into your datasets in many ways: bad joins, ingestion issues, or data entry errors. In BigQuery, deduplicating data is a common and essential task, especially when working with large datasets where performance matters.

In this post, we will explore multiple methods to deduplicate data in BigQuery.


🚩 When Do You Need Deduplication?

Imagine a table orders_raw where each order might appear multiple times:

order_idcustomer_idorder_dateamount
10112025-07-01 10:00:0050
10112025-07-01 10:00:0050
10222025-07-01 11:00:0030

You want only one row per order_id, keeping the most recent or largest record (or just any if they’re identical).


🏗️ Method 1: Using a Subquery + ROW_NUMBER()

Before QUALIFY was introduced, you’d typically handle this with a subquery and window function.

WITH ranked_orders AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) AS rn
  FROM orders_raw
)
SELECT *
FROM ranked_orders
WHERE rn = 1;

Explanation:

  • Assigns a row number per order_id, ordered by order_date (most recent first).
  • Keeps only the top-ranked row.

Good for: compatibility across SQL engines, and when chaining multiple steps.


🔄 Method 2: Using GROUP BY with MAX() or ANY_VALUE()

When you’re summarizing data and only care about specific fields, GROUP BY with aggregates works well.

SELECT order_id,
       MAX(order_date) AS latest_order_date,
       ANY_VALUE(customer_id) AS customer_id,
       MAX(amount) AS max_amount
FROM orders_raw
GROUP BY order_id;
  • MAX(order_date): gets the most recent date.
  • ANY_VALUE(): returns any one of the values in the group (use with care).

Good for: fast aggregations, dashboards, or non-critical deduplication.


🧱 Method 3: Using ARRAY_AGG(... ORDER BY ...) with LIMIT 1

If you want to return the entire row of the most recent or largest entry:

SELECT AS VALUE
  ARRAY_AGG(o ORDER BY order_date DESC LIMIT 1)[OFFSET(0)]
FROM orders_raw o
GROUP BY order_id;

Explanation:

  • ARRAY_AGG(... LIMIT 1) gathers only the top row (by order date).
  • [OFFSET(0)] extracts the first and only row from the array.

Good for: keeping full rows, more flexible than GROUP BY.


✅ Method 4: Using QUALIFY with ROW_NUMBER() (Most Elegant)

BigQuery’s QUALIFY clause allows filtering after window functions—eliminating the need for subqueries.

SELECT *
FROM orders_raw
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) = 1;

Explanation:

  • ROW_NUMBER() assigns a rank per order_id.
  • QUALIFY filters to keep only the top-ranked row per group.

Pros:

  • Clean and expressive.
  • No CTE or subquery required.
  • Great for readable production SQL.

🔚 Summary: Which Should You Use?

MethodBest ForReadabilityPerformance
Subquery + ROW_NUMBERLegacy systems or complex transformationsMediumHigh
GROUP BY + aggregatesSimple summaries✅ High✅ High
ARRAY_AGG() + LIMITFull row retention with custom logicMediumMedium
QUALIFY + ROW_NUMBERClean, modern deduplication on BigQuery✅ High✅ High

🧼 Final Thoughts

Deduplicating data is a routine but vital task. BigQuery gives you a flexible toolbox, whether you prefer legacy compatible approaches or modern, concise solutions like QUALIFY.

If you’re using BigQuery, try refactoring old subquery based deduplication into QUALIFY for a cleaner, easier to maintain workflow.


💬 Got a preferred method or tricky deduping scenario? Drop a comment or share your query!