Page MenuHomePhabricator

Enable wmfdata-py to access MariaDB replicas on the cluster
Open, Needs TriagePublic

Description

As a data scientist, I need wmfdata to access MariaDB replicas when it is used in a notebook executed on the cluster so that I can schedule the notebook as a data pipeline through Airflow.

In the Product Analytics ETL modernization sync-up on 26 June 2023 (notes) we identified that in the current wmfdata-python MariaDB module:

  • It checks POSIX group membership to determine which cnf to retrieve username & password from for connecting
  • It uses the analytics-mysql executable to determine which host & port to use for connecting (after parsing output)

To make it usable on the cluster:

  • Need a way of specifying which cnf to use (e.g. if we store the mysql password on HDFS and need to read it as analytics-product system user): T340469
  • Need a way of retrieving host & port info: T340472