Page MenuHomePhabricator

Private data access for non-person user that calculates metrics
Closed, ResolvedPublic3 Story Points

Description

Background

I'm going to try to summarize things as best as I can. Previously (on stat1002), we (Discovery-Analysis) had a set of scripts ("golden" data retrieval shell scripts, R scripts, Hive queries, and SQL queries) that ran on a daily basis (under Oliver's account and then under my account after they left WMF) and calculated KPIs/metrics for various Discovery teams' dashboards. As of T150915, that codebases Analytics' Reportupdater to run scripts/queries.

The datasets of these metrics are currently available in stat1005:/srv/published-datasets/discovery/, where they are made publicly available through https://analytics.wikimedia.org/datasets/discovery/

Running that codebase under a staff account is problematic for a several reasons, and we've wanted to switch to a non-staffer solution for some time. This was facilitated with the upgrade to stat1005 from stat1002 (RIP) and led to statistics::discovery which creates a non-staff "discovery-stats" user to run these scripts.

First, we ran into problems with ownership & access permissions, which were fixed in T172740 & T173333. Then, when testing the umask solution, @Gehel & I noticed that the shell script that queries Hive for Wikidata Query Service request counts was not working correctly:

2017-08-24 18:56:31,070 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-14 end=2017-08-15 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"...
2017-08-24 18:56:38,796 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len()
2017-08-24 18:56:38,797 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-15 end=2017-08-16 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"...
2017-08-24 18:56:46,529 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len()
2017-08-24 18:56:46,530 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 is_funnel=True first_date=2015-10-01 start=2017-08-16 end=2017-08-17 db_key=None sql_template=None script=modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'}>"...
...

Running the query manually (replacing $1 and $2 with actual dates) under my account showed that there wasn't anything wrong with the webrequest table or the query. Then @mforns suggested that perhaps the "discovery-stats" user simply didn't have access to Hive. This was confirmed by @Ottomata. Specifically, "discovery-stats" is not in the "analytics-privatedata-users" group that would give it Hadoop & webrequest access. Changing its primary group (as in T172740) worked partially, but that user needs to also be on namenodes.

Problem and Solution

Event Timeline

mpopov created this task.Aug 24 2017, 10:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2017, 10:05 PM

Change 373689 had a related patch set uploaded (by Bearloga; owner: Ottomata):
[operations/puppet@production] Include discovery-stats user in analytics_cluster::users

https://gerrit.wikimedia.org/r/373689

P.S. I should also add that we currently have several teams without performance metrics from the past 10 days (and counting), so getting this done is pretty important — hence the high priority. On 13 August 2017 I asked Guillaume to fix the permissions on the datasets so that I could run golden/main.sh as myself just to backfill metrics that we were missing since 23 July 2017 by that point. We can go back to that running-under-staff-account solution but that's just not sustainable (as discussed at length in T129260), so the switch to a non-person executing these scripts has to be done anyway.

debt added a subscriber: debt.Aug 24 2017, 11:21 PM

Great write-up of the situation, @mpopov, thanks!

Summary for @chasemp:

Analytics needs a way to:

  • create system users
  • have real users in certain groups be able to sudo to that system user
  • have system users be placed in real user groups, so that posix group permissions can be used to restrict access to data for both real users and system users
  • have system users created in places that users in real groups might not be (e.g. system user runs a cron on a box that a real user doesn't have access to).
Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 2:03 PM

That last bullet there is not as important as the first 3 :)

IRC convo with @chasemp, had 2 ideas:

  1. use ACLs. https://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/
  1. Create functionality in admin module to merge real users from groups with specified system users into a larger group. e.g.
systemuser_group_merges:
  analytics-privatedata:
    systemusers: [sys_userA, sys_userB]
    groups: [ analytics-privatedata-users ]			10:40

admin module would then create a new group analytics-privatedata including sys_userA, sys_userB, and all real users in the analytics-privatedata-users group. We'd chgrp the privatedata in HDFS to this new analytics-privatedata group.

Oo, one more idea:

  1. Make every user group have a corresponding system user, that could be selectively enabled. E.g. analytics-privatedata-users would have an analytics-privatedata, that would be realized if in data.yaml you set service_account: true or something like that.

Change 373952 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] discovery analytics - disable report updater cronjob

https://gerrit.wikimedia.org/r/373952

Change 373952 merged by Gehel:
[operations/puppet@production] discovery analytics - disable report updater cronjob

https://gerrit.wikimedia.org/r/373952

Mentioned in SAL (#wikimedia-operations) [2017-08-25T19:07:07Z] <gehel> kill stuck discovery report-updater process on stat1005 - T174110

Mentioned in SAL (#wikimedia-operations) [2017-08-25T19:09:20Z] <gehel> actually not killing the "stuck discovery report-updater process on stat1005", it is already gone - T174110

jeblad removed a subscriber: jeblad.Aug 25 2017, 9:21 PM
elukey added a subscriber: elukey.Aug 28 2017, 3:15 PM
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Ok, met with Chase and Luca, and we decided that Option 2 is the way to go. I'll make a subtask...

Ottomata moved this task from In Progress to Paused on the Analytics-Kanban board.Aug 30 2017, 3:11 PM
Nuria added a subscriber: Nuria.Nov 30 2017, 5:06 PM

This requires ops dicussion, likely to get done next quarted (starting January)

fdans moved this task from Paused to In Progress on the Analytics-Kanban board.Jan 4 2018, 5:38 PM
Ottomata moved this task from In Progress to Paused on the Analytics-Kanban board.Jan 23 2018, 3:42 PM
Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 19 2018, 12:19 AM
mpopov moved this task from Triage to Tracking on the Product-Analytics board.Apr 23 2018, 11:06 PM
mforns edited projects, added Analytics; removed Analytics-Kanban.May 7 2018, 4:01 PM
Ottomata set the point value for this task to 3.
Ottomata moved this task from Paused to Done on the Analytics-Kanban board.

Hey doods, this is done. analytics-search user is now in the analytics-privatedata-users group. So, in the statistics::discovery, you should be able to replace usages of the discovery-stats user to analytics-search user, and be able to run jobs that access webrequest data with it.

mpopov closed this task as Resolved.Jun 7 2018, 11:16 PM
mpopov awarded a token.

Awesome!!! Thank you so much, @Ottomata! :D

Change 373689 abandoned by Ottomata:
Include discovery-stats user in analytics_cluster::users

https://gerrit.wikimedia.org/r/373689