Google Search Console data is currently retained in 4 BigQuery tables, see
https://phabricator.wikimedia.org/T420996
Making the data available in the data lake is crucial for both long-term metrics tracking as well as detailed Google Search and clickthrough analysis.
The data can be best exported from BigQuery via daily, parquet shaped data exports into a Google Cloud Storage bucket, where it can be downloaded onto the data platform.
Authentication can be done with a service account and downloaded credentials.
There is a need to download ~85TB of data once and ongoingly ~300GB per daily going forward.
The downloaded data ultimately has to be made available as Hive tables in the data lake.
Requirements
- Access to Google Cloud Storage buckets via provided credentials
- Daily data download mechanism
- Download mechanism has to be configurable for 4 different source bucket paths and destination paths
- Capability to backfill back to 2025-08-27
- Daily parquet file upload to Hive table location in HDFS