Page MenuHomePhabricator

Enable movement metric calculation as analytics-product user
Closed, ResolvedPublic

Description

In T288657 the ownership of the wmf_product database was changed to iflorez:analytics-privatedata-users. Going forward we need figure out a way to insert data as analytics-product system user, rather than changing the ownership back and forth between @Mayakp.wiki and @Iflorez.

Option 1: create query/script version of the repos that have the notebooks and use kerberos-run-command analytics-product to execute the script which runs the queries with all the various partition info set correctly (including MediaWiki History snapshot)

We have proceeded with Option 2
Option 2: be able to execute jupyter notebooks as analytics-product user
by creating statistics::product_analytics Puppet code, deployed to stat1007 via misc_jobs enabling us to schedule shared & simple jobs without individual crontabs, without worrying about Kerberos.

Related Objects

Event Timeline

mpopov updated the task description. (Show Details)
ldelench_wmf triaged this task as High priority.
ldelench_wmf updated Other Assignee, added: Iflorez.

Attempts
test.py:

import wmfdata as wmf

results = wmf.presto.run("""
SELECT * FROM jdl.eas
LIMIT 10
""")

print(results)

Creating executable out of a script

Text file “test” (no extension)

#!/bin/bash

export PYTHONPATH=/usr/lib/anaconda-wmf/lib/python3.7/site-packages

python3 test.py

Then in terminal:

chmod +x test

# to run:
./test

Tried creating a stacked conda environment and use it as analytics-product

source conda-create-stacked analytics-product


sudo -u analytics-product PYTHONPATH=/home/bearloga/.conda/envs/analytics-product/lib/python3.7/site-packages python3 test.py

“ImportError: No module named wmfdata”

Then tried to use anaconda-wmf site-packages as PYTHONPATH:

sudo -u analytics-product PYTHONPATH=/usr/lib/anaconda-wmf/lib/python3.7/site-packages python3 test.py

Got errors:

Traceback (most recent call last):
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/requests_kerberos/kerberos_.py", line 2, in <module>
	import kerberos
ImportError: libcom_err.so.3: cannot open shared object file: No such file or directory


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/prestodb/auth.py", line 72, in set_http_session
	import requests_kerberos
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/requests_kerberos/__init__.py", line 17, in <module>
	from .kerberos_ import HTTPKerberosAuth, REQUIRED, OPTIONAL, DISABLED
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/requests_kerberos/kerberos_.py", line 4, in <module>
	import winkerberos as kerberos
ModuleNotFoundError: No module named 'winkerberos'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "temp/test.py", line 6, in <module>
	""")
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/wmfdata/presto.py", line 48, in run
	cursor = connection.cursor()
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/prestodb/dbapi.py", line 165, in cursor
	request = self._create_request()
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/prestodb/dbapi.py", line 155, in _create_request
	self.request_timeout,
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/prestodb/client.py", line 246, in __init__
	self._auth.set_http_session(self._http_session)
  File "/usr/lib/anaconda-wmf/lib/python3.7/site-packages/prestodb/auth.py", line 74, in set_http_session
	raise RuntimeError('unable to import requests_kerberos')
RuntimeError: unable to import requests_kerberos



sudo -u analytics-product kerberos-run-command analytics-product hdfs dfs -ls /

sudo -u analytics-product kerberos-run-command analytics-product ./test

Still get same errors

Need to find out if Data Engineering uses Python for any ETL jobs?

Mayakp.wiki lowered the priority of this task from High to Medium.Sep 14 2021, 5:05 PM

Takeaways from meeting on Sep 16, 2021. Attendees: Maya, Mikhail, Irene

Plan

  • Add variables where needed in notebooks to automate (month, fiscal year)
  • Create new git repo in analytics/wmf-product/jobs
  • 1 notebook that does 1 ETL to insert data into a test table and
  • Use Editing Repo for the test: gauge permissions and notebook runs
  • Point person for issues (since Mikhail OOO Oct 11-22) - TBD

Timing

  1. Check-in: 9/24 Irene and Maya look at which ETL job for first test
  2. Week of Sep 27 thru Oct 10/8 - puppet patch
  3. Test in October (after puppet patch is merged) : calculate September metrics
  4. Consolidate notebooks to this one
  5. Aim to launch by mid November

Dependencies - Ask data-engeering to change ownership of metric tables to analytics-product system user

Note: keep using this solution until Airflow has ETL for notebooks

Next steps: Create the following subtasks for breaking it up into smaller units of work:

  • puppet config for recurring monthly job & Python environment (& R library if we want to do visualization as part of the job)
  • Prototype notebook for testing inside of analytics/wmf-product/jobs
  • Migration of notebooks from the three GH repos to that repo (and perhaps separating out the reporting/visualization notebooks from ETL notebooks as a sub task for that?)

With regard to your testing above, I have verified the suggestion by @Ottomata on T291957 that using the anaconda environment directly does seem to work, without the need for a virtual env nor a conda env.

Here is your test script:

btullis@stat1007:~$ cat test.py 
import wmfdata as wmf

results = wmf.presto.run("""
SELECT * FROM jdl.eas
LIMIT 10
""")

print(results)

Here it is launched using the python interpreter located at: /usr/lib/anaconda-wmf/bin/python

btullis@stat1007:~$ /usr/lib/anaconda-wmf/bin/python ./test.py 
The check for a newer release of wmfdata failed to complete. Consider checking manually.
                         dt                 client_dt  ...                                  bespoke_dimension                                    bespoke_measure
0  2021-07-01T12:47:19.657Z  2021-07-01T12:47:19.020Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
1  2021-07-01T12:59:55.703Z  2021-07-01T12:59:20.540Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
2  2021-07-01T12:52:44.760Z  2021-07-01T12:52:15.291Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
3  2021-07-01T12:21:00.756Z  2021-07-01T12:22:10.440Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
4  2021-07-01T12:59:15.159Z  2021-07-01T12:58:50.331Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
5  2021-07-01T12:46:41.909Z  2021-07-01T12:46:11.775Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
6  2021-07-01T12:08:51.047Z  2021-07-01T12:16:34.271Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
7  2021-07-01T12:01:19.005Z  2021-07-01T12:00:47.465Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
8  2021-07-01T12:08:33.485Z  2021-07-01T12:08:03.509Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....
9  2021-07-01T12:32:48.274Z  2021-07-01T12:32:41.842Z  ...  {'save_failure_message': '0', 'init_mechanism'...  {'bucket': 0.0, 'first_transaction_timing': 0....

[10 rows x 12 columns]

Does this help at all with your migration?

When I run it as the analytics-product user I get a 401 error from Presto, but I think that is maybe related to permissions on the source table, rather than a python issue.

btullis@stat1007:~$ sudo -u analytics-product kerberos-run-command analytics-product /usr/lib/anaconda-wmf/bin/python ./test.py 2>&1|tail -n 1

prestodb.exceptions.HttpError: error 401: b'<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>\n<title>Error 401 Unauthorized</title>\n</head>\n<body><h2>HTTP ERROR 401</h2>\n<p>Problem accessing /v1/statement. Reason:\n<pre>    Unauthorized</pre></p>\n</body>\n</html>\n'

Actually, I think that this is a Kerberos error. Investigating.

Phewf, I got presto to work on the CLI.

sudo -u analytics-product kerberos-run-command analytics-product presto --catalog analytics_hive --krb5-keytab-path /etc/security/keytabs/analytics-product/analytics-product.keytab --krb5-principal analytics-product/stat1007.eqiad.wmnet@WIKIMEDIA

So, likely we need the same things set for the wmfdata.presto wrapper, namely:

  • kerberos keytab to /etc/security/keytabs/analytics-product/analytics-product.keytab
  • kerberos principal to analytics-product/$HOSTNAME@WIKIMEDIA
  • presto catalog seems to need to be set ahead of time when making the connection

We should make wmfdata smart about this, ping @nshahquinn-wmf

Ah great, thanks @Ottomata - I was just looking at the same thing with @elukey he helped me to identify the --krb5-principal $(whoami)@WIKIMEDIA in /usr/local/bin/presto part.

I suppose we could have a check in this wrapper to say:

if [ $(id -u) -lt 1000 ] # i.e. we are a system user with a user id of less than 1000
then
  set some specific --krb-variables for system users
else
  set the normal user --krb-variables
fi

Maybe it's not the cleanest solution, but it might work for the CLI and python. Maybe not.
Shall I make a follow-up ticket to address this issue anyway?

Oh, nice I had forgotten presto was a wrapper. Yes, a ticket would be great. I'm not so sure how to do with with the python presto client yet, am looking into that. So far, it looks like the python requests-kerberos which is used by the python presto client doesn't have any parameters or support for using keytabs.

I also don't see keytab support in the python kerberos lib used by requests-kerberos, but I could be missing something.

Can confirm using anaconda-wmf works (with the test notebook I'm using as minimal starting point)!

$ cd /tmp # on stat1008
$ sudo -u analytics-product -g analytics-privatedata-users mkdir bearloga-test
$ cd bearloga-test
$ sudo -u analytics-product cp /home/bearloga/test.ipynb
$ sudo -u analytics-product -g analytics-privatedata-users /usr/lib/anaconda-wmf/bin/python -m jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute test.ipynb
$ sudo -u analytics-product hive -e "SELECT * FROM wmf_product.test"
wiki_db	year	month	day	hour	posix_user
commonswiki	2021	9	29	19	analytics-product
wikidatawiki	2021	9	29	19	analytics-product

Hi everybody,
not sure if it is already reported or if this is not the right task, but the systemd timer product-analytics-movement-metrics on stat1007 is failed some days ago due to:

Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ------------------
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: hive.run('''
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: Select * from wmf.abc
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ''')
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ------------------
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ---------------------------------------------------------------------------
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ChildProcessError                         Traceback (most recent call last)
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: <ipython-input-1-e8067dab0696> in <module>
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:       1 hive.run('''
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:       2 Select * from wmf.abc
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ----> 3 ''')
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: 
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: /usr/lib/anaconda-wmf/lib/python3.7/site-packages/wmfdata/hive.py in run(commands, format, engine)
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     134     result = None
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     135     if engine == "cli":
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: --> 136         return run_cli(commands, format)
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     137 
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     138 def load_csv(
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: /usr/lib/anaconda-wmf/lib/python3.7/site-packages/wmfdata/hive.py in run_cli(commands, format, heap_size, use_nice, use_ionice)
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     111             raise ChildProcessError(
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     112                 "The Hive command line client encountered the following "
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: --> 113                 "error:\n{}".format(cleaned_stderr)
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     114             )
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]:     115     finally:
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: ChildProcessError: The Hive command line client encountered the following error:
Dec 09 00:05:46 stat1007 kerberos-run-command[11971]: FAILED: SemanticException [Error 10001]: Line 2:14 Table not found 'abc'

I am going to clear the alert about the timer, I see that the job is configured to run once a month, if it is a problem already solved discard what I wrote above :)

Hi @elukey the error was introduced on purpose to test error logging and error notification T295733. The job was intended to fail and to generate an email and a log file.
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/wmf-product/jobs/+/refs/heads/master/movement_metrics/notebooks/zzz_error_test.ipynb#20

Change 763552 had a related patch set uploaded (by Bearloga; author: Bearloga):

[analytics/wmf-product/jobs@master] movement_metrics: Remove error notebook and improve docs

https://gerrit.wikimedia.org/r/763552

Change 763552 merged by Bearloga:

[analytics/wmf-product/jobs@master] movement_metrics: Remove error notebook and improve docs

https://gerrit.wikimedia.org/r/763552

Moved subtasks related to "improving" the metrics calculation to a new Parent task T312880
*Inspiration Week reorganization :)

All sub-tasks under this one have been completed. I am changing this to Resolved.