Optimize Google Cloud BigQuery and Control Cost

How Did It Happen?

I was developing a script to prepare data samples for customers that reached out for consultations. The samples have 100 rows in each file and they are split in 55 locales. My query looks like this

SELECT *
FROM `project.dataset.table`
WHERE
  ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28")
  AND locale = "US"
LIMIT 100;

The data was stored in "europe-west-4" and the pricing for querying is $6 per TB. So by running the script, I processed

500 TB of data in total
3 TB of data per country on average
$54 per data sample file on average

Very expensive.

The Script that Costed $3,000

The script was written in JavaScript modules.

bq-samples.mjs

import { BigQuery } from "@google-cloud/bigquery";
import { Parser } from "@json2csv/plainjs";
import makeDir from "make-dir";
import { write } from "./write.mjs";
import { locales } from "./locales.mjs";
import { perf } from "./performance.mjs";

const q = (locale, start, end, limit) => `SELECT *
  FROM \`project.dataset.table\`
  WHERE
    ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28")
    AND locale = "${locale}"
  LIMIT ${limit}`

async function main() {
  const timer = perf()

  const dir = await makeDir('samples')
  const bigquery = new BigQuery()
  const csvParser = new Parser({})

  try {
    const jobs = locales.map((locale) => async () => {
      // get query result from BigQuery
      const [job] = await bigquery.createQueryJob({
        query: q(locale, "2022-12-01", "2023-02-28", 100),
      })
      const [rows] = await job.getQueryResults()

      // parse rows into csv format
      const csv = parse(csvParser, rows)

      // write data into csv files and store in the file system
      await write(csv, dir, locale, "2022-12-01", "2023-02-28", 100)
    })

    await Promise.all(jobs.map((job) => job()))

    console.log(`✨ Done in ${timer.stop()} seconds.`)
  } catch (error) {
    console.error('❌ Failed to create sample file', error)
  }
}

await main()

It generates one sample file in CSV format per locale. The process is straightforward:

Querying the BigQuery table with a local, start date, end date, and limit.
Transforming the query result into CSV format.
Writing the CSV in the file system.
Repeating the process for all locales.

What's The Problem?

It turns out that I did several things wrong in my query. If you look at the pricing model again, you'll notice the cost is only related to how much data you process. So it's clear that my query looked up too much data to produce 100 rows. With this insight, let's optimize the query step by step.

Don't Select *

It's a little counter intuitive. Why does my select statement have anything to do with how much data it processes? Regardless of the columns I select, I should be reading from the same resources and data, right?

It's only true for row-oriented databases.

BigQuery is actually a columnar database. It's column-oriented, meaning the data are structured in columns. BigQuery uses Dremel as its underlying computing engine. When the data is moved from the cold storage to the active storage in Dremel, it stores the data in a tree structure. Each leaf node is a column-oriented "record" in Protobuf format.

In BigQuery, each node is a VM. A query execution propagates from the root server (node) through intermediate servers to the leaf servers to retrieve the selected columns.

We can modify the query to select individual columns:

SELECT
  session_info_1,
  session_info_2,
  session_info_3,
  user_info_1,
  user_info_2,
  user_info_3,
  query_info_1,
  query_info_2,
  query_info_3,
  impression_info_1,
  impression_info_2,
  impression_info_3,
  ts
FROM `project.dataset.table`
WHERE
  ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28")
  AND locale = "US"
LIMIT 100;

Just by selecting all the columns explicitly, I was able to reduce the processed data from 3.08 TB to 2.94TB. That's a 100 GB reduction.

Use Partitioned Table and Query Only Subsets of Data

Google Cloud recommends us to partition tables by date. It lets us query only a subset of data.

To optimize the query further, we can narrow down the date range in the where statement because the table is partitioned by the "ts" column.

SELECT
  session_info_1,
  session_info_2,
  session_info_3,
  user_info_1,
  user_info_2,
  user_info_3,
  query_info_1,
  query_info_2,
  query_info_3,
  impression_info_1,
  impression_info_2,
  impression_info_3,
  ts
FROM `project.dataset.table`
WHERE
  ts = TIMESTAMP("2022-12-01")
  AND locale = "US"
LIMIT 100;

I narrowed down the date range to one day instead of three months. I was able to cut down the processed data to 37.43 GB. It's just a fraction of the original query.

Use Materialized Query Results in Stages

Another way to reduce costs is to reduce the dataset you're querying from. BigQuery offers destination tables to store query results as smaller datasets. Destination tables come with two forms: temporary and permanent. Because temporary tables has a lifetime and it's not designed to be shared and queried, I created a permanent destination table to materialize the query result:

bq-samples.mjs

const dataset = bigquery.dataset('materialized_dataset')
const materialzedTable = dataset.table('materialized_table')

// ...

const [job] = await bigquery.createQueryJob({
  query: q(locale, '2022-12-01', '2023-02-28', 100),
  destination: materialzedTable,
})

The query results will be stored in the destination table. It'll serve as a reference for future queries. Whenever it's possible to query from the destination table, BigQuery will process the data from the table. It'll greatly reduce the data size we look up.

Final Thoughts

It's a very interesting study to reduce the cost in BigQuery. With only three easy steps:

Don't use *
Use Partitioned Table and Query Only Subsets of Data
Use Materialized Query Results in Stages

I was able to reduce the processed data size from 3 TB to 37.5 GB. It reduces the total cost significantly from $3,000 to $30.

If you're interested in learning more about the BigQuery architecture, here're the references that helped me:

You can read more about BigQuery cost optimizations in the Google Cloud documentation.

Special thanks to Abu Nashir for collaborating with me on the case study and providing valuable insights that helped me understand BigQuery's architecture.

References

Abu Nashir - LinkedIn
A Look at Dremel - Peter Goldsborough
BigQuery - Google Cloud
BigQuery explained: An overview of BigQuery's architecture - Google Cloud
Colossus under the hood: a peek into Google’s scalable storage system - Google Cloud
Column-oriented DBMS - Wikipedia
Comma-separated values - Wikipedia
Creating partitioned tables - Google Cloud
JavaScript modules - MDN
Jupiter evolving: Reflecting on Google’s data center network transformation - Google Cloud
Method: jobs.query - Google Cloud
Protocol_Buffers - Wikipedia
Run interactive and batch query jobs - Google Cloud
Using cached query results - Google Cloud
Writing query results - Google Cloud

Here you have it! Thanks for reading through 🙌 If you find this article useful, please share it to help more people in their engineering journey.

🐦 Feel free to connect with me on twitter!

Ready for the next article? 👉 How to Run A Tech Community in Your Company: An Ex-Principal Engineer’s Guide

Happy coding!

Daw-Chih Liou

Daw-Chih is a software engineer, UX advocate, and creator who is dedicated to Web engineering. His background in Human Centered Computing has led him to work with startups and public companies across North America, Asia, and Europe. He is passionate about meeting business trajectory with user journey and utilizing engineering architecture and performance monitoring to provide optimal user experience.