Page MenuHomePhabricator

External referrer & WDQS metrics stopped updating on 2021-02-08
Closed, ResolvedPublic

Description

On stat1007:

$ ls -l /srv/published/datasets/discovery/metrics/external_traffic/
total 5548
-rw-rw-r-- 1 analytics-search analytics-search-users 3110291 Feb  9 05:02 referer_data.tsv
-rw-rw-r-- 1 analytics-search analytics-search-users 2565465 Feb  9 05:04 referer_nonbot_data.tsv

$ tail -n 1 /srv/published/datasets/discovery/metrics/external_traffic/referer_data.tsv 
2021-02-08	TRUE	external (search engine)	Daum	mobile web	141338

$ ls -l /srv/published/datasets/discovery/metrics/wdqs/
total 848
-rw-rw-r-- 1 analytics-search analytics-search-users 864510 Feb  9 05:07 basic_usage.tsv

$ tail -n 1 /srv/published/datasets/discovery/metrics/wdqs/basic_usage.tsv 
2021-02-08	/bigdata/ldf	TRUE	FALSE	14

These files are generated with the help of Reportupdater, run by main.sh that is scheduled through kerberos::systemd_timer in this manifest: statistics/discovery.pp. That belongs to the miscellaneous jobs manifest (profile::statistics::explorer::misc_jobs).

Notes: Both reports are non-R scripts that run hive:

These are necessary for the still-used External Traffic (https://discovery.wmflabs.org/external/) and Wikidata Query Service (https://discovery.wmflabs.org/wdqs/) dashboards.


Log of steps taken:

  1. updated the reportupdater submodule to the latest commit (https://gerrit.wikimedia.org/r/c/wikimedia/discovery/golden/+/677244)
  2. sudo -u analytics-search git submodule update in stat1007:/srv/discovery/golden
  3. reset my venv on stat1007 via https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter-SWAP#Resetting_user_virtualenvs
  4. ran pip install -U pid python-dateutil pymysql PyYAML Jinja2 dnspython (pip install -r reportupdater/requirements.txt` failed)
  5. Created /srv/discovery/venv and installed reportupdater's dependencies there as analytics-search (see T279443#6994725), so that this process is no longer dependent on me and my venv
  6. Uploaded & merged patch that sets PYTHONPATH in discovery.pp (see 678864, fixed in 678891)
  7. Uploaded & merged patch to fix queries for current version of Hive on analytics clients (679490), expecting reportupdater to backfill the reports from Feb 8 onward since we're still within the 90 day retention window

Event Timeline

Restricted Application added a subscriber: revi. · View Herald TranscriptApr 6 2021, 3:17 PM
mpopov triaged this task as Unbreak Now! priority.Apr 6 2021, 3:17 PM
mpopov edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
mpopov updated the task description. (Show Details)

A hypothesis is that I may have reset my venv on stat1007 around that time and the libraries reportupdater needs were not available anymore to it. I installed the required libraries. Will see if data starts getting backfilled on the next scheduled run. Will ping Luca & Marcel if problem persists.

We (@Lydia_Pintscher and myself) also ran into this issue yesterday when trying to get some usage stats for our reporting.

Asked about this on IRC in #wikimedia-analytics. Luca said:

in theory the venv should be shipped with the repo if needed, but not sure what mforns did in the past for RU

Marcel said:

I never had to use venvs with reportupdater, I avoided using any library within reportupdater that was not already part of the stats machines' environments...

Then Luca:

it would be good if we could deploy the venv (a frozen one basically) as part of the discovery golden scap repo (if it is deployed via scap) and reference it in the timer

if we add a "venv" or similar dir in there with PYTHONPATH it should work in theory

Andrew added:

you might be able to get away without setting PYTHONPATH, if you just launch python out of your venv


I like Andrew's suggestion (just not the part where it would use my staff venv), so if I have it right the steps to fix this would be:

  1. Create analytics-search user-usable venv inside /srv/discovery, installing the required packages for reportupdater in there
  2. Update main.sh to call update_reports.py with /srv/discovery/venv/bin/python

@elukey: Before I do that, does that sound like a good approach? This way no modification is needed for discovery.pp

@mpopov the venv is ok for me! I am not completely sure about the /srv/discovery/venv/bin/python, I'd have used PYTHONPATH (we can add it very easily via puppet in case needed), but if tests are good +1 from my side :)

On stat1007.eqiad.wmnet:

$ cd /srv/discovery
$ sudo -u analytics-search -g analytics-search-users python3 -m venv venv
$ sudo -u analytics-search -g analytics-search-users --preserve-env /srv/discovery/venv/bin/python3 -m pip install -r golden/reportupdater/requirements.txt

(Making sure to have the proxy env variables set prior to installing reportupdater's dependencies.)

export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080

Change 678864 had a related patch set uploaded (by Bearloga; author: Bearloga):

[operations/puppet@production] statistics::discovery: Set PYTHONPATH

https://gerrit.wikimedia.org/r/678864

@amy_rc: It may also be an issue with the version of python your system user defaults to. For example:

$ sudo -u analytics-search python --version
Python 2.7.16

So when update_reports.py is run, python2 interpreter is being used instead of python3.

I decided to go with @elukey's recommendation of setting PYTHONPATH with Puppet (thank you, Luca!): https://gerrit.wikimedia.org/r/c/operations/puppet/+/678864

Change 678864 merged by Elukey:

[operations/puppet@production] statistics::discovery: Set PYTHONPATH

https://gerrit.wikimedia.org/r/678864

After seeing that the reports were still not updated, I tried to see what was happening by updating manually:

$ sudo -u analytics-search -g analytics-privatedata-users /usr/bin/nice /usr/bin/ionice /srv/discovery/venv/bin/python /srv/discovery/golden/reportupdater/update_reports.py -l info /srv/discovery/golden/modules/metrics/external_traffic /srv/published/datasets/discovery/metrics/external_traffic
2021-04-14 20:42:53,485 - INFO - Starting execution.
2021-04-14 20:42:54,099 - INFO - Executing "<Report key=referer_data type=script granularity=days lag=0 first_date=2015-10-01 start=2021-02-09 end=2021-02-10 db_key=None hql_template=None sql_template=None script=/srv/discovery/golden/modules/metrics/external_traffic/referer_data explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'} group=analytics-search-users>"...
2021-04-14 20:43:04,282 - ERROR - Report "referer_data" could not be executed because of error: object of type 'NoneType' has no len()
Traceback (most recent call last):
  File "/srv/discovery/golden/reportupdater/reportupdater/executor.py", line 161, in execute_script_report
    report.results = self.normalize_results(report, None, tsv_reader)
  File "/srv/discovery/golden/reportupdater/reportupdater/executor.py", line 198, in normalize_results
    empty_row = [report.start] + [None] * (len(normalized_header) - 1)
TypeError: object of type 'NoneType' has no len()

I then ran the query manually and indeed there was a problem with running the query: AS date no longer worked due to the new version of Hive as of February 9, 2021 (see https://lists.wikimedia.org/pipermail/analytics-announce/2021-February/000059.html, T273711). In order to fix the query I had to use backticks around "date" and escape them in the scripts:

Before:

#!/bin/bash

hive -S --hiveconf mapred.job.queue.name=nice -e "ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION is_external_search AS 'org.wikimedia.analytics.refinery.hive.IsExternalSearchUDF';
CREATE TEMPORARY FUNCTION get_engine AS 'org.wikimedia.analytics.refinery.hive.IdentifySearchEngineUDF';
USE wmf;
SELECT
  '$1' AS date,
...

After:

#!/bin/bash

hive -S --hiveconf mapred.job.queue.name=nice -e "ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION is_external_search AS 'org.wikimedia.analytics.refinery.hive.IsExternalSearchUDF';
CREATE TEMPORARY FUNCTION get_engine AS 'org.wikimedia.analytics.refinery.hive.IdentifySearchEngineUDF';
USE wmf;
SELECT
  '$1' AS \`date\`,
...

I tested the queries manually and yep, (escaped) backticks did the trick!

Change 679490 had a related patch set uploaded (by Bearloga; author: Bearloga):

[wikimedia/discovery/golden@master] Backtick 'AS date' in Hive queries

https://gerrit.wikimedia.org/r/679490

Change 679490 merged by Bearloga:

[wikimedia/discovery/golden@master] Backtick 'AS date' in Hive queries

https://gerrit.wikimedia.org/r/679490

mpopov removed a project: Patch-For-Review.
mpopov updated the task description. (Show Details)
$ tail -n 1 /srv/published/datasets/discovery/metrics/external_traffic/referer_data.tsv 
2021-04-14	TRUE	external (search engine)	Ecosia	mobile web	407912

$ tail -n 1 /srv/published/datasets/discovery/metrics/wdqs/basic_usage.tsv 
2021-04-14	/bigdata/namespace/wdq/sparql	TRUE	FALSE	1065136

Everything is running without problems again!