[[ https://github.com/wikimedia/wmfdata-python | wmfdata-py ]] is currently the best way to query our production data sources for data analysis. It has excellent support for
- Hive via PyHive, meanwhile wmfdata-r still relies on wrapping `hive` CLI and suffers from the same problems as wmfdata-py did prior to the switch (T275233)
- Spark via PySpark, but our configuration prevents us from using [[ https://spark.rstudio.com/ | sparklyr ]]
- Presto via presto-python-client, but RPresto's support for Kerberos-ized setups appears to be [[ https://github.com/prestodb/RPresto/issues/122 | ¯\_(ツ)_/¯ ]]
A new version of [[ https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/wmf/ | wmfdata-r ]] is needed, one that is just a wrapper for wmfdata-py's database-accessing functions (via [[ https://rstudio.github.io/reticulate/articles/package.html | reticulate ]]).
Not only is wmfdata-r vastly out of data and limited in its ability to access production data sources, it has also become bloated – with miscellaneous functions (e.g. sample size calculations for χ2 test, Wikimedia color palettes) added over time that should actually be factored out into a separate package or forgotten entirely.