Background
I'm going to try to summarize things as best as I can. Previously (on stat1002), we (Discovery-Analysis) had a set of scripts ("golden" data retrieval shell scripts, R scripts, Hive queries, and SQL queries) that ran on a daily basis (under Oliver's account and then under my account after they left WMF) and calculated KPIs/metrics for various Discovery teams' dashboards. As of T150915, that codebases Analytics' Reportupdater to run scripts/queries.
The datasets of these metrics are currently available in stat1005:/srv/published-datasets/discovery/, where they are made publicly available through https://analytics.wikimedia.org/datasets/discovery/
Running that codebase under a staff account is problematic for a several reasons, and we've wanted to switch to a non-staffer solution for some time. This was facilitated with the upgrade to stat1005 from stat1002 (RIP) and led to statistics::discovery which creates a non-staff "discovery-stats" user to run these scripts.
First, we ran into problems with ownership & access permissions, which were fixed in T172740 & T173333. Then, when testing the umask solution, @Gehel & I noticed that the shell script that queries Hive for Wikidata Query Service request counts was not working correctly:
2017-08-24 18:56:31,070 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-14 end=2017-08-15 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"... 2017-08-24 18:56:38,796 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len() 2017-08-24 18:56:38,797 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-15 end=2017-08-16 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"... 2017-08-24 18:56:46,529 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len() 2017-08-24 18:56:46,530 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-16 end=2017-08-17 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"... ...
Running the query manually (replacing $1 and $2 with actual dates) under my account showed that there wasn't anything wrong with the webrequest table or the query. Then @mforns suggested that perhaps the "discovery-stats" user simply didn't have access to Hive. This was confirmed by @Ottomata. Specifically, "discovery-stats" is not in the "analytics-privatedata-users" group that would give it Hadoop & webrequest access. Changing its primary group (as in T172740) worked partially, but that user needs to also be on namenodes.
Problem and Solution
- Short-term: Otto proposed replacing "discovery-stats" with "analytics-search" user since that user already exists and has access.
- We also need to consider if analytics-search-users should automatically have analytics-privatedata-users level access and the potential security implications.
- Long-term: we probably need a more generic system user for querying "analytics-privatedata-users" data such as wmf.webrequest which contains PII.