Page MenuHomePhabricator

Load Google Search Console data into the Data Lake
Open, HighPublic

Description

Since early 2025, as undetected bot traffic exploded and human traffic started to decline, there has a been a surge in use of Google Search Console (GSC) data. Examples include:

However, it's hard to discover that this data exists and even once you do, getting access and using it is very difficult. The options are (WMF only):

  • manual exploration in GSC's web interface followed by manual DSV data export
    • obviously, this doesn't scale
    • rolling 16 months of data available
  • the Search Console API
    • limited to 50,000 rows per day per search-type, which means that our per-page or per-query data is severely truncated
    • rolling 16 months of data available
  • the exported data in BigQuery:
    • no row limits
    • access requires back-and-forth with one of handful of WMF senior leaders who manage access ad-hoc and maintaining a secret token which is not recommended to be stored on a stat host
    • separate from our normal analytics infrastructure, so it's necessary to use a new package for the querying
    • all data starting 2025-08-27 will be available indefinitely

It would be vastly easier if we set up a job to periodically export the data from BigQuery and load it in bulk into the Data Lake. It would be:

  • accessible through existing roles and permissions
  • discoverable through exisiting tools like DataHub
  • queryable with our existing query engines
  • exportable with our existing ways of working with HDFS

Some notes about the BigQuery data:

  • it's in the form of searchdata_site_impression and searchdata_url_impression tables for wikipedia.org and wikimedia.org
    • the site_impression table is much smaller and can be fully derived from the url_impression table
    • the wiki dimension is only available by parsing the url field of the url_impression table
  • although it's unlikely that we would release the full data publicly, it contains no confidential data (e.g. IP addresses). In essence, it's impression and click counts per query per result URL, with rare queries nulled out.

Event Timeline

While this would be very useful for Movement-Insights, from our perspective it's not top priority (unlike, for example, T418032).

JerryWang-WMF added subscribers: Ahoelzl, JerryWang-WMF.

@Ahoelzl The data grow faster and BigQuery cost is high. Let's bump up the priority for a pipeline to ingest daily data from BigQuery to our own data lake.

I see the following data sets in Big Query:

  • wikipedia.org
    • site
    • url
  • wikimedia.org
    • site
    • url

@nshahquinn-wmf should we make all of them available, and download what is available in Big Query once and then pulled from the API periodically going forward? Thanks!

We definitely want the wikipedia.org data. But we may not want the wikimedia.org data as I haven't seen any interest in the sites under that domain like Commons and Meta-Wiki. There was some SEO work on Commons last year (T400022) but it seems like the regular GSC data was enough for that.

@SCherukuwada, since you set up the wikimedia.org export, should we pull it into the Data Lake? One goal here is to delete the data in BigQuery after importing it to save on storage costs, so I think the question is largely equivalent to "do we really want to keep wikimedia.org data indefinitely?".

Since the site data is just an aggregation of the url data but doesn't provide the wiki dimension, I suggest we just import the site data and then make a better aggregation inside the Data Platform.

We don't need to touch the API here as the complete GSC dataset is exported to BigQuery on an daily basis. Basically, we want to have an ongoing daily job that exports the site dataset and then deletes it from BigQuery (plus doing the one-time catch-up run from the start of the data in BigQuery).

Thanks @nshahquinn-wmf ! Storing wikimedia data in addition should be trivial, see sizes below.
Do we want to keep any data in BigQuery then? E.g. use it as a short term buffer?

Stats

All data sets have data since 2025-08-27.

wikipedia.org

  • site impressions
    • 150.41G rows • 13.52 TB • Last modified Apr 30, 2026, 12:51 AM
  • url impressions
    • 468.2G rows • 78.37 TB • Last modified Apr 30, 2026, 1:28 AM

wikimedia.org

  • site impressions
    • 5.27G rows • 477 GB • Last modified Apr 30, 2026, 1:53 AM
  • url impressions
    • 13.9G rows • 2.51 TB • Last modified Apr 30, 2026, 1:53 AM

Do we want to keep any data in BigQuery then? E.g. use it as a short term buffer?

Yeah, my understanding is that we want to use it as a short-term buffer only, as we're already paying a major BigQuery storage cost which will only increase unless we start purging it. But @JerryWang-WMF, please confirm.

Some more specific thoughts about the import mechanics:

  • Google sometimes goes back and revises past data, which is recorded in the ExportLog table. They state on the support page that this is rare, and it has only affected two days of our data since its start on 2025-08-27 (2026-04-23 and 2026-04-24, both revised on 2026-04-27; I don't think we have a way to know the scale of the revision).
    • We could set up a second-pass job that runs after, say, a month and reimports if necessary (and delete the BigQuery data only after that job has run), but it doesn't seem worth the effort.
    • @mpopov do you have an opinion about this since you tried to export GSC data via the API years ago but ran into problems due to Google's data revisions? The BigQuery bulk export feature didn't exist back then so it's possible that they're now more restrained in revising data.
  • In addition to the BigQuery storage cost, there's also a query cost. When @santhosh tried importing the data in March, it ended up costing a lot (roughly double the storage cost for that month), so we should be sure we assess the cost of the different data extraction options.
    • @santhosh can you share what method you used and how much of the data you ended up extracting that way?
  • There's no reason to import the site_url field as it simply restates the property name from the table name (wikipedia.org or wikimedia.org).
  • Similarly, there are a number of is_[search_appearance_type] fields that are deprecated and can be excluded from the import.
  • If the import costs look to be large, we could explore excluding the following fields, but these do have meaningful data so we would be cutting off some use cases.
    • the non-deprecated is_[search_appearance_type] fields
    • the query field. This is likely to be the largest single-field, so the savings from excluding it might be very significant, but so might be the loss of utility. Looking at the search performance of different articles (i.e. different values of url) would be our main use case, but there could definitely be interest in look at query patterns as well. @santhosh is there any plan to look at queries as part of your project?
  • If resources permit, it would be useful to do some basic transformations on the data during import to make things easier for data users.
      • Change field names to align with the data modeling guidelines:
        • data_datedate
        • impressionsimpression_count
        • clicksclick_count
        • devicedevice_type (not from the modeling guidelines, but just seems like a much more intuitive name to me)
    • Parse the url field to extract the wiki and page name into separate fields.

I just happened to run across this project (WMF only) which recently used the GSC data in BigQuery to gather a large dataset of question-style search terms. From this document explaining how they generated they dataset, they were very conscious of the BigQuery query costs.

So I think it's a good illustration of both why getting this data into the Data Lake would be very valuable and, more crucially, why we really should not exclude the query field as I mentioned above unless the cost considerations become overwhelming.

@santhosh can you share what method you used and how much of the data you ended up extracting that way?

This the code I used to extract google search data from BigQuery:
https://gitlab.wikimedia.org/santhosh/wiki-search-analytics/-/blob/410246c99b703c5bc9f10dd6c6c573faaf7defd2/src/bq_export_local.py

I did only one test of this script - for a single day day and that itself was quite costly. It exported 50GB and aborted with quota issue. So I did not run it again.

@santhosh is there any plan to look at queries as part of your project?

No. Not as part of content observability project. We are interested only in how search impressions and clicks vary for a given article. But it is possibile that search related WMF projects are interested in raw search queries.

I would very much recommend importing and keeping wikimedia.org data - at least for commons. We're likely going to need to have some visibility into web search metrics around commons.

After evaluating various technical download options and considering cost profiles (thanks @nshahquinn-wmf for all the considerations and references!), we suggest the following approach:

Export process

BigQuery Table
      │
      │  (Scheduled Full BQ export - no extra costs)
      ▼
Google Cloud Storage (GCS)
      │
      │  (Hive external table pointing to GCS - download costs)
      ▼
Hive External Source Table
      │
      │  (dbt incremental model — appends only new dates)
      ▼
Hive Target Table

Exported data

Full BigQuery (BQ) table exports for

  • wikipedia.org url impressions
  • wikimedia.org url impressions

wikipedia.org site impressions and wikimedia.org site impressions will be derived from these on the data platform.

We are not considering Export Table logs (retrospective changes) ingestion for now.

Cost considerations
Projecting the BQ tables, i.e. omitting columns, to GCS requires an export query which is billed on the processed data which is probably more expensive than the costs for the extra download data, hence full table export.

One time cost:
Downloading 2.51 TB (wikimedia.org url impressions) + 78.37 TB (wikipedia.org url impressions) = ~85TB (meanwhile) downloaded once at ~$6,350.

Ongoing operations

  • BQ table data will be retained for 1 week
  • GCS export data will be retained for 1 week of data

@nshahquinn-wmf can you confirm that the approach makes sense?
@JerryWang-WMF for cost tracking

Thank you for putting this together! Yes, it makes good sense to me.

We are not considering Export Table logs (retrospective changes) ingestion for now.

I agree; that seems like a ton of extra work and substantial extra storage cost for what is probably a very minor benefit.

In the presumably very rare case that Google does a major data revision, we can just note the issue and, if necessary, not use the affected data, just like we'd do for a data issue in, say, pageviews. In that situation, we'd also still be able to look at Google Search Console to see the post-correction aggregates.

Also, the reasoning for doing a full BigQuery export makes total sense to me. But, if it's possible to reshape the table a bit during the loading to the Hive target table, it would still be nice to do the following:

    • Change field names to align with the data modeling guidelines:
      • data_datedate
      • impressionsimpression_count
      • clicksclick_count
      • devicedevice_type (not from the modeling guidelines, but just seems like a much more intuitive name to me)
  • Parse the url field to extract the wiki and page name into separate fields.