Beefing Up BigQuery Cost Data with INFORMATION_SCHEMA and Audit Logs

To deepen your BigQuery cost insights beyond raw billing data, this article shows you how to break down usage by jobs and tables using INFORMATION_SCHEMA and Audit Logs. It highlights the valuable parts of INFORMATION_SCHEMA.JOBS and TABLE_STORAGE_USAGE_TIMELINE, and also compares it to Audit Logs.

Beefing Up BigQuery Cost Data with INFORMATION_SCHEMA and Audit Logs

To augment your billing data with more details, there are two primary technical data sources in BigQuery: the INFORMATION_SCHEMA and the Audit Logs. Compared to the Billing Exports, these provide more granular information that can be used to apportion costs at the job and table levels.

Using the INFORMATION_SCHEMA

Before we dive in, let's clarify what the INFORMATION_SCHEMA is. It is a set of read-only system views that let you query valuable metadata about jobs (queries), tables, datasets, and more. It is very convenient to use, as you can use SQL to return and analyze the data, as opposed to more complicated API calls.

The most valuable parts of the INFORMATION_SCHEMA from a FinOps perspective are found in these two views:

  1. For jobs/queries that consume much of the compute resources, the INFORMATION_SCHEMA.JOBS view provides detailed insight.
  2. For tables that are in direct relation to your storage consumption, the INFORMATION_SCHEMA.TABLE_STORAGE_USAGE_TIMELINE view provides the daily usage details over time.

The INFORMATION_SCHEMA.JOBS View

The INFORMATION_SCHEMA.JOBS view, also known as the INFORMATION_SCHEMA.JOBS_BY_PROJECT view, provides the data to break down the cost by job/query, identified by job_id, and it offers various facts and dimensions useful for usage and cost apportionment.

The usage facts for apportionment are:

  • total_bytes_billed for on-demand consumption, which is based on scanned bytes, with a minimum billing of 10MB, even if less data was scanned.
  • total_slot_ms for capacity-based consumption, which is based on slot usage measured in time (milliseconds). When estimating costs, reservations and commitments also need to be considered, as the baseline will need to be paid even if not used, and the unit price will depend on the level of commitments.

Then, to slice and dice the above consumption facts, the most crucial usage dimensions are:

  • projects with project_id,
  • reservations with reservation_id,
  • user or service account with user_email,
  • timeline with start_time and end_time.

If this still does not suffice, even more details are available for more technically inspired analyses:

  • To distinguish job and statement types, there are job_type and statement_type, respectively.
  • To analyse cache hits (e.g., ratio to all), you can use the boolean cache_hit.
  • To understand more about errors, you can check out error_result, which returns null on success or more details about the error in JSON on error.
  • To relate jobs to tables, use referenced_tables or destination_table.

The TABLE_STORAGE_USAGE_TIMELINE View

The INFORMATION_SCHEMA.TABLE_STORAGE_USAGE_TIMELINE view (different from the INFORMATION_SCHEMA.TABLE_STORAGE view!) provides daily granularity data over time to break down storage costs at the table level, identified by the TABLE_NAME.

There are various storage usage facts, which are measured in MiBs, and they all start with BILLABLE_* by convention:

  • For logical storage (total, active, long-term):
    • BILLABLE_TOTAL_LOGICAL_USAGE
    • BILLABLE_ACTIVE_LOGICAL_USAGE
    • BILLABLE_LONG_TERM_LOGICAL_USAGE
  • For physical storage (total, active, long-term):
    • BILLABLE_TOTAL_PHYSICAL_USAGE
    • BILLABLE_ACTIVE_PHYSICAL_USAGE
    • BILLABLE_LONG_TERM_PHYSICAL_USAGE

Then the key dimensions for grouping and aggregating your consumption are:

      • projects with PROJECT_ID,
      • datasets with TABLE_SCHEMA (practically containing the dataset ID),
      • timeline with USAGE_DATE (as this view provides daily granularity data).

Tips & Tricks, Plus Some Limitations to Watch Out For

For enterprise-scale users with multiple projects and regions, additional work is required to obtain a complete view of a payer account. There is no consolidated view that gives you the full picture out of the box, so you will need to consider a billing period, and then:

  1. Identify all project IDs from the billing export.
  2. Iterate through all projects AND regions to collect data from the INFORMATION_SCHEMA.
  3. Eventually, consolidate them.

To accomplish this, you will need the necessary permissions to query this data. Otherwise, your query will return an incomplete result or an error.

Jobs' and tables' consumption are measured in various units and scales (e.g., bytes vs. MiB). I recommend converting them to the same scale. This is also useful for matching the pricing sheet and the pricing calculator units, where the preferred scales of measurement are TiB for on-demand consumption (data scanned) and storage, and slot-hours for capacity-based compute consumption. Here is how you can do the conversions in SQL:

  • Bytes to TiB for on-demand consumption total_bytes_billed / POW(1024, 4).
  • MiB to TiB for storage consumption BILLABLE_*_USAGE / POW(1024, 2).
  • Ms (milliseconds) to hours for capacity-based consumption SAFE_DIVIDE(total_slot_ms, 1000*60*60).

Scripted queries have both aggregated data at the parent level and split data at the child level. To avoid double-counting, exclude the aggregate. The way to do this is to exclude the SCRIPT statement type by adding the condition WHERE statement_type != 'SCRIPT'.

The provider uses Pacific Time (PST/PDT) for discriminating billing months and days, and daylight saving time also applies. Make sure that you consider this when using UTC (Universal Time Coordinated) or your local time zone when working with your data. The provider consequently uses the TIMESTAMP data type which represents an absolute point in time. To interpret this in a specific time zone, BigQuery provides great convenience functions, so in this case the conversion is easy using the "America/Los_Angeles" time zone, with one of these functions:

  • DATE(your_timestamp, "America/Los_Angeles") for a DATE in PST/PDT, and
  • DATETIME(your_timestamp, "America/Los_Angeles") for a DATETIME in PST/PDT.

There are also limitations for RLS, ML, and Spark.

For queries over tables with RLS (Row Level Security), BigQuery hides specific statistics, including billable usage.

For BigQuery ML (Machine Learning), on-demand pricing depends on the type of model created, which is not tracked in INFORMATION_SCHEMA.JOBS, and instead it assumes the higher-billed model types.

For Apache Spark, there is a limitation on the SKU-level reporting, which makes the SKU indistinguishable from BigQuery Enterprise Edition pay-as-you-go SKU.

If any of these limitations affect you, read the official docs for more details.

Using Audit Logs

While I prefer using the INFORMATION_SCHEMA for billing data enrichment, Audit Logs can also be used as an alternative for job/query cost apportionment.

BigQuery provides several types of logs, out of which the AuditData is the most interesting for FinOps. Other types include BigQueryAuditMetadata and AuditLog, however, those are less relevant for our purposes and are out of scope for this discussion.

When a job – such as query, load, export, copy, or ML – is done (or cancelled), a JobCompletedEvent log is emitted. This event, along with the AuthenticationInfo in the Audit Logs, provides the information required to apportion the cost of jobs. It can also be linked to the INFORMATION_SCHEMA with the matching job identifier.

Here are the most valuable parts from the Audit Log's hierarchical JSON schema:

  • AuthenticationInfo (full official docs for AuthenticationInfo here)
    • principalEmail – the user or service account executing the job
  • JobCompletedEvent (full official docs for BigQuery AuditData here)
    • Job
      • JobName
        • JobId – the job identifier, which
          can be linked to the job_id in INFORMATION_SCHEMA.JOBS
        • ProjectId – the project identifier
        • location – the region/zone
      • JobStatistics
        • Usage facts:
          • totalBilledBytes – can be used to apportion on-demand spend per job or other dimensions based on data scanned, when multiplied by the cost/unit
          • totalSlotMs – can be used to apportion spend across jobs or other dimensions for capacity-based usage (BigQuery Editions), based on slot usage time, when multiplied by cost/unit (do not forget to account for reservations and commitments, where the baseline needs to be paid even if not used, and the level of commitments determines the unit price)
        • Useful dimensions:
          • reservation – the reservation
          • startTime and endTime – the usage timeline
          • referencedTables and referencedViews – the linked tables/views
        • Useful in calculating unit metrics:
          • queryOutputRowCount – for cost per output row
          • totalLoadOutputBytes – for cost per output byte
      • JobConfiguration
        • It holds job-type-specific low-level details in various objects:
          query, load, extract, tableCopy
      • JobStatus
        • state – holds the state of the job, either PENDING, RUNNING, or DONE
        • error and additionalErrors – if the job did not complete successfully, it describes the error(s)

Regarding the technical implementation, I do not have hands-on experience to describe precisely how it is done. Based on the high-level research I had done, the steps involve (1) enabling logging for the desired scope, (2) exporting the logs to BigQuery by creating a sink with the Logs Router, and (3) querying the logs via SQL and building reusable views.

However, as I prefer the INFORMATION_SCHEMA, I will let you do your own research, should you choose this path. If you do so, you will find these docs a good companion for your journey:

Now, let's explore why I prefer the other path.

Why I Like the INFORMATION_SCHEMA Better

There are many reasons why I think the INFORMATION_SCHEMA is a better choice, and the most important ones are:

  • It contains more information compared to Audit Logs, especially concerning table (storage) data.
  • It was designed to serve as a metadata repository, as opposed to Audit Logs, which, as the name suggests, are more event-focused and were primarily intended for usage audit purposes.
  • Its storage does not incur additional costs, unlike Audit Logs, for which you will need to pay for its storage if you want to export to BigQuery. Analyses (compute) will still cost money for both.
  • It can be queried with simple SQL right out of the box in a tabular format, unlike Audit Logs, for which the native hierarchical structure requires more thinking and unnesting.

So, that's what I wanted to share about enriching your billing data for detailed BigQuery insights and cost and usage apportionment.

I hope you learned something and enjoyed it. That's my pleasure!

If you haven't done it yet, check out my related writing about FinOps for BigQurey:

FinOps for BigQuery
As BigQuery usage surges, FinOps teams are facing a challenge to keep things under control. In this post, I walk you through seven actionable levers in fair detail to help you gain complete control.