Private data access for non-person user that calculates metrics
Open, HighPublic

Description

Background

I'm going to try to summarize things as best as I can. Previously (on stat1002), we (Discovery-Analysis) had a set of scripts ("golden" data retrieval shell scripts, R scripts, Hive queries, and SQL queries) that ran on a daily basis (under Oliver's account and then under my account after they left WMF) and calculated KPIs/metrics for various Discovery teams' dashboards. As of T150915, that codebases Analytics' Reportupdater to run scripts/queries.

The datasets of these metrics are currently available in stat1005:/srv/published-datasets/discovery/, where they are made publicly available through https://analytics.wikimedia.org/datasets/discovery/

Running that codebase under a staff account is problematic for a several reasons, and we've wanted to switch to a non-staffer solution for some time. This was facilitated with the upgrade to stat1005 from stat1002 (RIP) and led to statistics::discovery which creates a non-staff "discovery-stats" user to run these scripts.

First, we ran into problems with ownership & access permissions, which were fixed in T172740 & T173333. Then, when testing the umask solution, @Gehel & I noticed that the shell script that queries Hive for Wikidata Query Service request counts was not working correctly:

2017-08-24 18:56:31,070 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-14 end=2017-08-15 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"...
2017-08-24 18:56:38,796 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len()
2017-08-24 18:56:38,797 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-15 end=2017-08-16 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"...
2017-08-24 18:56:46,529 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len()
2017-08-24 18:56:46,530 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-16 end=2017-08-17 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"...
...

Running the query manually (replacing $1 and $2 with actual dates) under my account showed that there wasn't anything wrong with the webrequest table or the query. Then @mforns suggested that perhaps the "discovery-stats" user simply didn't have access to Hive. This was confirmed by @Ottomata. Specifically, "discovery-stats" is not in the "analytics-privatedata-users" group that would give it Hadoop & webrequest access. Changing its primary group (as in T172740) worked partially, but that user needs to also be on namenodes.

Problem and Solution

mpopov created this task.Thu, Aug 24, 10:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Aug 24, 10:05 PM

Change 373689 had a related patch set uploaded (by Bearloga; owner: Ottomata):
[operations/puppet@production] Include discovery-stats user in analytics_cluster::users

https://gerrit.wikimedia.org/r/373689

P.S. I should also add that we currently have several teams without performance metrics from the past 10 days (and counting), so getting this done is pretty important — hence the high priority. On 13 August 2017 I asked Guillaume to fix the permissions on the datasets so that I could run golden/main.sh as myself just to backfill metrics that we were missing since 23 July 2017 by that point. We can go back to that running-under-staff-account solution but that's just not sustainable (as discussed at length in T129260), so the switch to a non-person executing these scripts has to be done anyway.

debt added a subscriber: debt.Thu, Aug 24, 11:21 PM

Great write-up of the situation, @mpopov, thanks!

Summary for @chasemp:

Analytics needs a way to:

  • create system users
  • have real users in certain groups be able to sudo to that system user
  • have system users be placed in real user groups, so that posix group permissions can be used to restrict access to data for both real users and system users
  • have system users created in places that users in real groups might not be (e.g. system user runs a cron on a box that a real user doesn't have access to).
Restricted Application added a subscriber: jeblad. · View Herald TranscriptFri, Aug 25, 2:03 PM

That last bullet there is not as important as the first 3 :)

IRC convo with @chasemp, had 2 ideas:

  1. use ACLs. https://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/
  1. Create functionality in admin module to merge real users from groups with specified system users into a larger group. e.g.
systemuser_group_merges:
  analytics-privatedata:
    systemusers: [sys_userA, sys_userB]
    groups: [ analytics-privatedata-users ]			10:40

admin module would then create a new group analytics-privatedata including sys_userA, sys_userB, and all real users in the analytics-privatedata-users group. We'd chgrp the privatedata in HDFS to this new analytics-privatedata group.

Oo, one more idea:

  1. Make every user group have a corresponding system user, that could be selectively enabled. E.g. analytics-privatedata-users would have an analytics-privatedata, that would be realized if in data.yaml you set service_account: true or something like that.

Change 373952 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] discovery analytics - disable report updater cronjob

https://gerrit.wikimedia.org/r/373952

Change 373952 merged by Gehel:
[operations/puppet@production] discovery analytics - disable report updater cronjob

https://gerrit.wikimedia.org/r/373952

Mentioned in SAL (#wikimedia-operations) [2017-08-25T19:07:07Z] <gehel> kill stuck discovery report-updater process on stat1005 - T174110

Mentioned in SAL (#wikimedia-operations) [2017-08-25T19:09:20Z] <gehel> actually not killing the "stuck discovery report-updater process on stat1005", it is already gone - T174110

jeblad removed a subscriber: jeblad.Fri, Aug 25, 9:21 PM
elukey added a subscriber: elukey.Mon, Aug 28, 3:15 PM
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Ok, met with Chase and Luca, and we decided that Option 2 is the way to go. I'll make a subtask...

Ottomata moved this task from In Progress to Paused on the Analytics-Kanban board.Wed, Aug 30, 3:11 PM
Tbayer added a subscriber: Tbayer.Tue, Sep 19, 8:18 PM