I Finally Queried a PDF with SQL, and It Felt Like Magic

I used to look at our Google Cloud Storage buckets with a sense of frustration.

They were full of valuable data—scanned invoices from vendors, product images, customer support transcripts—all locked away in unstructured files like PDFs and JPEGs.

As a data engineer who thinks in SQL, this was the final frontier, a wall between my structured world in BigQuery and the messy reality of the business. I always thought analyzing that data would require some complex, bespoke Python pipeline.

I was wrong. The day I first ran a SQL query that joined a relational table with the contents of a PDF, it felt like I had unlocked a superpower.

The magic comes from combining three GCP services.

First, we use BigQuery Object Tables. This creates a read-only “metadata table” in BigQuery that simply lists the files in a GCS bucket. It doesn’t move or parse them; it just makes them addressable with SQL.

Second, we set up a Cloud Function that acts as a wrapper around a powerful Vertex AI API, like Document AI for parsing forms. This function takes a file path as input, calls the AI service, and returns the extracted information as clean JSON.

The third and final piece is a BigQuery Remote Function, which is just a SQL-callable alias for our Cloud Function. This is what connects the dots. Now, from within a dbt model, I can write SQL that looks something like this: SELECT my_dataset.parse_invoice(gcs_uri) FROM my_object_table.

This is exactly how we automated our accounts payable verification. A dbt model runs daily.

It queries the Object Table to find new invoice PDFs that have landed in the GCS bucket. For each new file, it calls the remote function, which sends the PDF to Document AI and gets back a JSON object with the invoice ID, total amount, and vendor ID.

The dbt model then uses BigQuery’s built-in JSON functions to parse these fields and joins them against our trusted dim_vendors table to check for discrepancies.

The entire, incredibly powerful workflow is managed in a single SQL file, orchestrated by dbt.

We didn’t just query a PDF; we brought an entire unstructured business process into our analytical world.