Page MenuHomePhabricator

wmfdata-r v2 should mainly be a wrapper for wmfdata-py
Open, LowPublic

Description

wmfdata-py is currently the best way to query our production data sources for data analysis. It has excellent support for

  • Hive via PyHive, meanwhile wmfdata-r still relies on wrapping hive CLI and suffers from the same problems as wmfdata-py did prior to the switch (T275233)
  • Spark via PySpark, but our configuration prevents us from using sparklyr
  • Presto via presto-python-client, but RPresto's support for Kerberos-ized setups appears to be ¯\_(ツ)_/¯

A new version of wmfdata-r is needed, one that is just a wrapper for wmfdata-py's database-accessing functions (via reticulate).

Not only is wmfdata-r vastly outdated and limited in its ability to access production data sources, it has also become bloated – with miscellaneous functions (e.g. sample size calculations for χ2 test, Wikimedia color palettes) added over time that should actually be factored out into a separate package or forgotten entirely.

Event Timeline

While it would be a significant initial investment, long-term maintenance would be minimal since majority of the maintenance burden would fall on the underlying Python codebase.

ldelench_wmf moved this task from Triage to Backlog on the Product-Analytics board.

Repo with package skeleton created at https://gitlab.wikimedia.org/repos/product-analytics/wmfdata-r

Nothing actually implemented yet