Since early 2025, as undetected bot traffic exploded and human traffic started to decline, there has a been a surge in use of Google Search Console (GSC) data. Examples include:
- T414996: [MI 3] Monitor and investigate movement trends: Movement-Insights uses it to calculate weekly Google Search clicks to Wikipedia, which is one of the most important metrics that we monitor weekly and report biweekly to WMF leadership
- Content observability (WMF only): @santhosh has been building a system to allow us to detect and analyze trends in what users are searching for and reading, which relies on Google Search impressions, clicks, and click-through rates as crucial signals. Some examples of what he's been working on:
However, it's hard to discover that this data exists and even once you do, getting access and using it is very difficult. The options are (WMF only):
- manual exploration in GSC's web interface followed by manual DSV data export
- obviously, this doesn't scale
- rolling 16 months of data available
- the Search Console API
- limited to 50,000 rows per day per search-type, which means that our per-page or per-query data is severely truncated
- rolling 16 months of data available
- the exported data in BigQuery:
- no row limits
- access requires back-and-forth with one of handful of WMF senior leaders who manage access ad-hoc and maintaining a secret token which is not recommended to be stored on a stat host
- separate from our normal analytics infrastructure, so it's necessary to use a new package for the querying
- all data starting 2025-08-27 will be available indefinitely
It would be vastly easier if we set up a job to periodically export the data from BigQuery and load it in bulk into the Data Lake. It would be:
- accessible through existing roles and permissions
- discoverable through exisiting tools like DataHub
- queryable with our existing query engines
- exportable with our existing ways of working with HDFS
Some notes about the BigQuery data:
- it's in the form of searchdata_site_impression and searchdata_url_impression tables for wikipedia.org and wikimedia.org
- the site_impression table is much smaller and can be fully derived from the url_impression table
- the wiki dimension is only available by parsing the url field of the url_impression table
- although it's unlikely that we would release the full data publicly, it contains no confidential data (e.g. IP addresses). In essence, it's impression and click counts per query per result URL, with rare queries nulled out.