FinOps for BigQuery
As BigQuery usage surges, FinOps teams are facing a challenge to keep things under control. In this post, I walk you through seven actionable levers in fair detail to help you gain complete control.

Recently, a couple of folks shared with me that they are using BigQuery in their company, and as FinOps practitioners, they are curious to know how to do its cost management. Many of them experienced a rapid increase in consumption and have also heard or seen some spooky stories about costs skyrocketing.
To help them, I thought to share my experience, as I have been working with BigQuery for a while. Seven actionable items popped into my mind that help you when you are building out your FinOps practice to embrace BigQuery:
- Understand consumption and pricing models
- Choose the right model and estimate your costs
- Set limits and quotas to prevent overconsumption
- Track cost evolution with data and visuals
- Create alerts and notifications to keep awareness
- Use caching and acceleration to reduce compute cost
- Educate your users about best design and usage practices
Let's walk through these.
Understanding Pricing
There is no substitute for understanding what you are doing.
Let me repeat that. The power of this is often underestimated – just like costs. Interesting coincidence by the way. So:
There is no substitute for understanding what you are doing.
If you are an electrician ⚡ your life depends on this. Even others' lives...
As a FinOps practitioner, you are not saving lives, but at least cloud dollars with your knowledge. It is your role, even identity, to have the best knowledge about consumption and pricing models. BigQuery included. This enables you to answer questions, explain concepts to others, educate on best practices, and position yourself as an expert.
BigQuery consumption comes in various shapes and sizes, and the pricing model that is best for you depends heavily on your usage patterns. You will need to understand at least these concepts and distinctions:
- The separation of compute and storage pricing, as well as other services, such as Omni, ML, BI Engine, and Streaming.
- For compute pricing, the distinction between on-demand, and capacity-based pricing, and for the latter, the different BigQuery Editions: Standard, Enterprise, and Enterprise Plus.
- For capacity-based consumption, the concept of slots and the distinction between reservations (of capacity) and commitments (for discounts) are key. Additionally, you need to understand baseline and autoscaling slots.
- For storage pricing, the difference between logical and physical storage, as well as active and long-term storage tiers.
If you already grasp all of the above and can explain them to others at any time, you are already good. Otherwise, invest heavily in reading the linked resources above.
Choosing the Best Consumption Model and Estimating Costs
To get to a pricing estimate, first, you will need to figure out which consumption model fits you the best. For enterprise users with many projects, it will probably be a mix of different models based on the characteristics of the various use cases.
Choosing a Model
As a starting point, a high-level comparison of the on-demand and the capacity-based models is available in the workload management documentation pages. Regarding the capacity-based options (BigQuery Editions), it's important to understand how baseline and autoscale work together (doc links in the previous section). Some hints:
- An infrequent spiky workload is likely the most optimal with on-demand pricing. The unit cost of on-demand consumption will always be the highest. Still, this way you avoid paying continuously for a reservation (and commitment), which would otherwise result in high unused capacity for this workload pattern.
- A steady and flat workload is best served by a capacity reservation accompanied by a high baseline coverage and a corresponding commitment. When your workload is consistently running with only slight fluctuation, you can reserve capacity and commit to one or three years, if it is not expected to drop or change its pattern significantly in this time frame. This will bring your unit prices down. With a well-set baseline, you will not experience much unused capacity or a too thick high unit cost portion.
- A wavy workload will benefit from using autoscaling, most often combined with a baseline capacity reservation and commitment. Cloud workloads reflect the cyclical demand patterns of our businesses, with highs and lows throughout the day (see the callout below for the reasons). Covering the bulk of the traffic with a baseline and allowing autoscale to add capacity during peak times is often the sweet spot. The best way to determine the optimal split is to conduct your own calculations and experiments.
Estimating Your Costs
Once you have understood your needs and chosen the matching model(s), you are ready to estimate your expected cost using the pricing calculator. When you select the BigQuery service in the calculator, there are 16 input parameters that you can tweak to obtain your estimate. These are the key factors (some of them apply only to specific service types or Editions):
- Your service type will define the units you are billed on, e.g., per query for on-demand, and per slots (capacity) for Editions.
- Your region and location will define your unit prices.
- Your Edition will define capacity constraints and features, e.g., maximum reservation size or maximum number of reservations in admin project.
- Your maximum slots and baseline slots (the latter not available in the Standard Edition!), along with autoscale utilisation, will define the volume of your compute consumption.
- Your commitment duration and size (slots committed) will define your discount levels (generally 10% for 1 year and 20% for 3 years).
- For storage, your expected volumes are calculated separately for the combinations of active/long-term and logical/physical options, as well as the write API.
- Eventually, streaming insert and read volumes are closing the calculator.
Considerations for Enterprise Use
In a larger company, you will also need to consider a centralized, shared commitment model while managing capacity reservations in a distributed manner. This approach helps you achieve a global optimum for the compute unit price. There is also a Slot Estimator to help you find the optimal commitment level. Earlier, I called this the "Sloth Estimator" as it was loading for hours, but this issue has been fixed now. Use it with vigilance and always contrast it with your own analysis to avoid an irreversible overcommitment.
A brief alert, using multiple distributed commitments carries an increased risk of unused commitments. This is because the excess cannot be shared across separate commitments, resulting in a lower chance of it being utilised. Also note that both reservations and commitments are location-specific (region or multi-region), so make sure to group your users, projects, and workloads by location in your forecasts that inform your decisions.
Limits and Quotas
When creating a Google Cloud project to onboard your users and workloads to BigQuery, the first thing you want to do – immediately with the project setup – is to configure limits and quotas for consumption. By default, many parameter values are set to 'Unlimited'. This means that your consumption will scale 'infinitely' up to the underlying capacity limits. With massive parallel scaling, this is can be dangerous, and an intentional review and configuration of limits and quotas is mandatory.
There are numerous quotas and limits, but the ones you want to focus on first are:
-
QueryUsagePerDay
to control project-level aggregated usage of all users -
QueryUsagePerUserPerDay
to control the usage of each user (or service account) separately within a project (it applies the same limit to all users, no distinction possible within a project)
To set these parameters, you will need the role/servicemanagement.quotaAdmin
IAM role on the project. If you do not have it, ask for it from your administrator (often your security team or the project owner).
A word of warning. While quotas and limits can prevent cost incidents, they can equally trigger operational incidents if you are not careful. Imagine losing the orders of your favourite customers and the accompanying revenue for your company, because a production system just ran out of its quota or crossed its limit.
What a scene! On-call engineers are triggered in the middle of the night, and you find yourself standing in the hot-spot-light on the next day's postmortem with the unquestionable root cause "in your hand". Even if postmortems are blameless, it is not the most fun thing to stand there dressed in nothing but confidence.
Cost Visibility
Once your users have started consuming BigQuery, you want to watch how their usage and costs evolve. You have various options: (1) the ready-made Cloud Billing Reports, (2) the Billing Data Exports, and (3) adding Technical Data Sources for more granularity. Let's see them one by one.
The Console Billing Reports – Readily Usable
Out of the box, you can get the Cloud Billing Reports, which allow you to track high-level usage across several key dimensions: project, service, SKU, location (region and zone), labels, and organizations/folders. This is a great starting point. Familiarize yourself with the Services and SKUs of BigQuery to understand usage and cost over time.
The Billing Export – Raw Data for More Flexibility
If you seek more flexibility, the Billing Export will provide you with the underlying information of the Billing Console in a raw data table. By setting up the Billing Export, you will get your billing data in BigQuery, providing ultimate flexibility in querying, analyzing, and visualizing your data. Watch out, there are two different usage cost tables available, and the one you need for BigQuery is the detailed one as opposed to the standard.
Once you have set up the export, you can use BigQuery in three different ways to interact with your data on the query (SQL) level (two more analytical and visual options I will describe in a subsequent section).
- Web browser: type and run SQL queries and get the results in the console
- Command line: use the
bq
CLI tool from your favorite shell - Code: use client libraries (
C#
,Go
,Java
,Node.js
,PHP
,Python
,Ruby
)
Bonuses: The Billing Report has a nice feature. Once you click around the filters and get your desired billing report view, you can get the underlying SQL query by clicking the Generate Query button. This will jump from the Billing Report right into BigQuery's SQL editor with your SQL statement precreated for you. You can also use the example queries in the documentation for inspiration, or, if you have onboarded the FOCUS™ format already, use the corresponding view. Enjoy!
Technical Data Sources – For More Granularity
After looking into your Billing Export, you will quickly realize that it is not very detailed. It enables you to dig down to the dataset level cost, but only for storage, and it misses several table or job level details.
To augment your billing data, there are two primary technical data sources in BigQuery: the INFORMATION_SCHEMA and the Audit Logs (also see Cloud Audit Logs in general). These provide more granular information that can be used to apportion your cost from the billing data for jobs (compute) and tables (storage).
The INFORMATION_SCHEMA
The most valuable parts of the INFORMATION_SCHEMA from a FinOps perspective are found in INFORMATION_SCHEMA.JOBS
table, which describes jobs and the INFORATION_SCHEMA.TABLE_STORAGE_USAGE_TIMELINE
view, which describes storage .
The Audit Logs
Audit Logs (also see Cloud Audit Logs in general) can also be used as an alternative to extract data for detailed apportionment. When a job (query, load, export, copy, ML) is done or cancelled, a JobCompletedEvent
is emitted. This, along with the AuthenticationInfo
in the Audit Logs provides all the details required to apportion the cost.
The two sources can also be linked together by the job identifier.
With these additional sources, you can easily identify, for instance, which are the top-consuming projects, jobs, or users (whether human or a service account). Additionally, you can get a detailed breakdown of storage costs at the table level, including the costs of physical and logical storage, as well as the costs of active and long-term storage tiers.
Analysing and Visualizing Your Cost and Usage Data
To analyse and visualise your cost and usage data, your easy options in the Google ecosystem are Google Sheets and Looker Studio.
Google Sheets
Google Sheets has a feature called Connected Sheets, which allows you to connect to your underlying BigQuery tables by defining an SQL statement for the connection. It also accepts input parameters referenced from cells in your sheets. This allows you to set the query filters (e.g., WHERE
conditions), like start date and end date, or project ID, dynamically. Once the connection is established, you are ready to analyse your data with pivot tables, charts, and more.
Connected Sheets are mighty, but I'd like to share two caveats. One, do not forget that you are paying for the underlying BigQuery queries. Cha-ching! Two, the size of the data that you can fetch is limited, so aggregation and filtering are your best friends when defining your SQL statement for the connection.
Looker Studio
Looker Studio (a distinct product from Looker) provides you with a simplified BI platform experience. Easy to use, but also limited – it is still sufficient for most simple business use cases. It provides semantic-layer-like features: you can connect, mix, and model your data, do aggregations, and create metrics. Then, on top of that, you can build reports with charts, organised into dashboards with a menu system.
Alerting and Notifications
For alerting and notifications, there are two built-in features worth mentioning: budget alerts and cost anomaly alerts. These are not BigQuery-specific features, but generic ones that apply to all services consumed in projects, so they are equally useful for BigQuery also.
There is a distinction between the concept of a budget alert and a cost anomaly alert. They are both valuable and related features, but they aim to solve different problems.
Budget Alerts
Budget Alerts are based on the premise that there is a configured expectation of how much you plan to spend in a specific future period. If you are about to exceed this amount, or have surpassed it already, it will alert you. It supports email notifications for humans and Pub/Sub notifications for programmatic handling.
Cost Anomalies
Cost Anomalies are designed to detect spikes or deviations from the typical consumption patterns observed in the past. Upon detection, similarly to Budget Alerts, email or Pub/Sub notifications can be sent to human or programmatic recipients.
Rolling Your Own
If you have your own data, you can use it to set up your custom-fitted alerting and notification. As a starting point, even with the standard usage cost table data, you can alert on the project level using a straightforward algorithm, such as comparing the latest period with the previous similar period. Since weekly and daily cyclicality is common, you can compare, for instance, the latest available day's data with the same day of the week last week. Not a fancy machine learning algo, but a very pragmatic one and fast to implement.
With the additional technical data sources, you can augment your alerts and notifications. For example, you can alert individuals about their large queries or send top consumer or query lists to teams and managers to highlight impactful opportunities. To implement these, you will need some engineering skills, but in exchange, they can be fully customized and sent to any channel, whether it is the company chat or the incident management tooling.
Anyhow, we need to remember that alerting is inherently reactive. What's more, all three of the above options share the same limitation. They are all based on the same billing data that suffers from a 1-2 day delay to eventually become complete. This means that your alerts will also be equally delayed, and by the time you receive them, you will have already spent a lot of money. Keep this in mind, and make sure that preventive measures are in place with (1) quotas and limits and (2) users' education.
BigQuery Caching and BI Engine
Here's my wooden definition for caching:
Caching is remembering the results instead of recalculating them.
When in school you learned the multiplication table (times table), you were caching. Now, when you need to know how much 8×8 is, you retrieve it from your memory, and you do not need to rack your brain or pull out a piece of paper to calculate manually.
You are actually saving the cost of computation.
The same applies to BigQuery. When you have a notorious BIG query, say, you are scanning hundreds of millions of rows to find a handful rows which match your criteria, or you are summing up values across a large number of dimensions to have one single figure, it is better to store those results, so the next time someone asks the same question, you can pull the answer "out of your pocket".
But caching has its own cost, too. What are the best use cases for caching? The queries that require a lot of computation AND run frequently. If you have a dashboard, the front page report of which is opened by hundreds of users daily, you can save a significant amount of computation and money. The first user request will trigger the calculations, the BI Engine caches them, and the subsequent users enjoy a fast and cheap experience.
A word of warning. What's the main challenge with caching? It is to know when your cached data gets obsolete and needs a refresh. To solve this, systems apply dedicated logic to invalidate caches (based on events or conditions) or expire caches (time-based configuration). In BigQuery and BI Engine, this is handled automatically.
BigQuery Result Cache
BigQuery comes equipped with a 24-hour query result cache at no extra cost. For query runs, it stores the result, and if the very same query (matching the SQL statement character by character) is repeated, the result will be retrieved from cache. When the underlying data changes, the cache is automatically invalidated. One of its limitations is that results are only reused across users within the same project and location.
BI Engine
BI Engine is a more equipped, paid, in-memory analysis service that accelerates SQL queries for Business Intelligence (BI) workloads, primarily reports and dashboards. Applied on top of your dataset, it can save money and improve performance, as it has more advanced features than BigQuery's built-in result caching mechanism:
- It stores parts of the underlying data (table, partition, column segments), not just query results.
- It works more flexibly with queries, allowing for some query parameters to change, such as time range or other filters.
- It refreshes data incrementally, so only those parts of the data get updated that have changed.
- It benefits all users across the same project and region
BI Engine is consumed by making a BI Engine capacity reservation for it, which is attached to a project and a region. You can optionally specify the preferred tables to use for the acceleration. For frequently visited dashboards, it will surely pay off quickly.
Educate Users about Efficient Usage
How much you pay for BigQuery services will largely depend on how users use it. Specifically, (1) how well their data storage is designed, (2) how efficient their queries are, and (3) how much their configurations match business needs.
Equipping users with this knowledge before they build or use anything will save a lot of money and headaches. Consider creating a gatekeeping mechanism to make sure people get trained before they get their BigQuery access. Here are the most impactful topics I suggest covering (different personas can be selectively educated on these aspects):
Data Storage Design
- Native BigQuery storage rather than federated external tables is the best choice for active data, to get the best performance and the predictability of cost.
- Denormalised schemas with nested or repeated fields (big tables, if you will) are preferred over star-schema and normal forms (*NF). This is because joins can become expensive due to the distributed storage and compute design.
- Partitioning large time-series tables along the time dimension will dramatically reduce data scans, as the partitions that are not part of the query criteria will be skipped when reading (this is also called pruning). Sharding (date-named tables) is a bad alternative and should be avoided.
- Clustering data within partitions on the most frequently filtered and/or joined columns will massively reduce reads inside the partition.
- Partition/Cluster Recommender helps you spot missed opportunities.
For a more comprehensive overview, take a look at the official docs for storage best practices.
Querying
- Selecting only the required columns and not using a wildcard (SELECT *) will ensure that less data will be read and processed.
- Filtering by the column used for partitioning ensures that partitions not matching the filter criteria will be skipped (pruned).
- Filtering early, before you join or unnest columns, will ensure that only the required data will be passed on to the following processing stages.
- Joining small tables to the right allows broadcast joins (sending the small table to processing slots that process the larger table), which is more efficient than shuffling.
- Materialized views (as opposed to logical views) precompute and periodically store the view, making them a perfect fit for frequent aggregations, filters, and joins.
For a more complete overview, check out the full official docs for compute performance best practices.
Configuration
- The Time travel feature enables you to access changed or deleted data within a predefined time window. By default, this feature is on with a 7-day retention and is charged for when you are using physical storage.
- The Partition expiration configuration ensures that obsolete partitions are automatically deleted after expiration.
- You can require a partition filter for all queries of partitioned tables to ensure that all queries include the partition column in their filter condition, thereby enabling pruning.
- your
One more idea. Do you have the support of an account manager or account team from your provider? If so, ask your Technical Account Manager (TAM) to invite experts and share best practices in live sessions with your users, allowing them to ask questions. It's a great opportunity: hearing these straight from the source boosts user interest and establishes authority.
As a closure, if you want to know even more about optimization and BigQuery overall, consider these complementary readings:
- Official Docs
- PDF whitepapers:
- The complete BigQuery Explained blog series:
- An overview of BigQuery's architecture
- Storage overview, and how to partition and cluster your data for optimal performance
- How to ingest data into BigQuery so you can analyze it
- How to query your data
- Working with joins, nested & repeated data
- How to run data manipulation statements to add, modify and delete data stored in BigQuery
A Wrapup
That's a wrapup! I hope you learned something and enjoyed it.
That’s my pleasure!