Page MenuHomePhabricator

Provide a scheduled data download service from Google Cloud Storage
Open, Needs TriagePublic

Description

Google Search Console data is currently retained in 4 BigQuery tables, see
https://phabricator.wikimedia.org/T420996

Making the data available in the data lake is crucial for both long-term metrics tracking as well as detailed Google Search and clickthrough analysis.

The data can be best exported from BigQuery via daily, parquet shaped data exports into a Google Cloud Storage bucket, where it can be downloaded onto the data platform.
Authentication can be done with a service account and downloaded credentials.

There is a need to download ~85TB of data once and ongoingly ~300GB per daily going forward.

The downloaded data ultimately has to be made available as Hive tables in the data lake.

Requirements

  • Access to Google Cloud Storage buckets via provided credentials
  • Daily data download mechanism
  • Download mechanism has to be configurable for 4 different source bucket paths and destination paths
  • Capability to backfill back to 2025-08-27
  • Daily parquet file upload to Hive table location in HDFS