Page MenuHomePhabricator

Use Reportupdater for WMCS edits queries
Open, NormalPublic

Description

First step for visualizing WMCS edits data - set up a cron job for wmcs-edits.py script. Store the datasets in https://analytics.wikimedia.org/datasets/periodic/reports/metrics/ in the TSV format via the Reportupdater: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2019, 8:27 PM
srishakatux triaged this task as Normal priority.Sep 11 2019, 8:27 PM
srishakatux updated the task description. (Show Details)
srishakatux updated the task description. (Show Details)Sep 11 2019, 8:29 PM

@JAllemandou Some update and questions on the reportupdater:

  • The docs says: “The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future...”. Do I need to follow this?
  • As the docs say that I should put all queries and scripts in a repo, I’ve it here along with a config.yaml file: https://github.com/srish/wmcs-edits. The script takes a start and end parameter right now and outputs a wmcs_edits.tsv file. I am not sure if this is sufficient for the reportupdater or I should try to generate the output in a different format. As we are going to anyways rely on the SQL query, maybe I should not spend too much time fixing the script at this point. I am guessing that when we use the SQL query, the config.yaml should be updated to include database information.
  • I'm getting setup for writing SQL query to get the data via geoeditors_daily. For reference, I’m looking at: https://github.com/wikimedia/analytics-limn-ee-data/blob/master/ee-migration/daily_edits.sql and trying to run it via beeline -f daily_edits.sql > out.txt but I’m getting the following error and not sure what I am doing wrong here:
Error: Error while compiling statement: FAILED: ParseException line 3:18 cannot recognize input near 'AS' '{' 'wiki_db' in selection target (state=42000,code=40000)
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 3:18 cannot recognize input near 'AS' '{' 'wiki_db' in selection target
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:241)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:227)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:255)
	at org.apache.hive.beeline.Commands.executeInternal(Commands.java:989)

Hi @srishakatux - Please excuse me again for another delayed answer - I do hope my personal issues are over now :)

@JAllemandou Some update and questions on the reportupdater:

  • The docs says: “The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future...”. Do I need to follow this?

I don't know if the limitation is still valid or not - I have looked at some example queries (see repo in my next line) and the first column is always date ... @mforns can you tell us more?

About your code I suggest using a new folder in this repo: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/reportupdater-queries as it already contains all the queries (or so I think).

The script takes a start and end parameter right now and outputs a wmcs_edits.tsv file. I am not sure if this is sufficient for the reportupdater or I should try to generate the output in a different format. As we are going to anyways rely on the SQL query, maybe I should not spend too much time fixing the script at this point. I am guessing that when we use the SQL query, the config.yaml should be updated to include database information.

The script you have written doesn't follow the reportupdater convertion (not far though :): The input parameters are not named-parameters but positional parameters, and the expected TSV output is to be sent to stdout. As we plan on relying on hive for the raw data, it's better IMO not to spend too much time on the python script. There are examples of reports generated using hive queries in the repo I pasted above (for instance the browser folder). It is interesting to note that using hive in reportupdater is actually made using scripts, and that SQL is used for MySQL only queries.

Error: Error while compiling statement: FAILED: ParseException line 3:18 cannot recognize input near 'AS' '{' 'wiki_db' in selection target (state=42000,code=40000)
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 3:18 cannot recognize input near 'AS' '{' 'wiki_db' in selection target
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:241)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:227)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:255)
	at org.apache.hive.beeline.Commands.executeInternal(Commands.java:989)

There multiple things here :) The above query is meant to run on MySQL, not hive, so the syntax and DBs are not aligned. Also, this example uses the explode_by function of reportupdater, allowing to run one query per parameter from a file (see https://github.com/wikimedia/analytics-limn-ee-data/blob/master/ee-migration/config.yaml#L19 and https://github.com/wikimedia/analytics-limn-ee-data/blob/master/ee-migration/wiki_dbs.txt).
I suggest using https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/browser as an example starting point: the config file is simple enough and the hive-query-scripts as well - You will just need to reduce them to a single config block and a single script file :)

Also, the code to extract network-origin in geoditors_daily hive table has been merged. This means we should have production data next month. I have a test table you can use to test your script manually: joal.test_geoeditors_daily_network_origin. It contains a single month partition for 2019-08. Here is an example query:

SELECT
  wiki_db,
  network_origin,
  SUM(edit_count) as edit_count
FROM joal.test_geoeditors_daily_network_origin
WHERE month = '2019-08'
  AND wiki_db IN ('wikidatawiki', 'enwiki')
GROUP BY
  wiki_db,
  network_origin
ORDER BY
  wiki_db,
  network_origin
LIMIT 100;

wiki_db	network_origin	edit_count
enwiki	internet	3979331
enwiki	wikimedia_labs	5650
wikidatawiki	internet	2562977
wikidatawiki	wikimedia	63849
wikidatawiki	wikimedia_labs	8618997

I hope my answers are clear enough :)
I'll be working late this evening, please don't hesitate to ping me :)

In T232671#5548529, @srishakatux wrote:
@JAllemandou Some update and questions on the reportupdater:

  • The docs says: “The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future...”. Do I need to follow this?

I don't know if the limitation is still valid or not - I have looked at some example queries (see repo in my next line) and the first column is always date ... @mforns can you tell us more?

Yes, this limitation is still there. The first column of the results should be the date in YYYY-MM-DD format.

About your code I suggest using a new folder in this repo: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/reportupdater-queries as it already contains all the queries (or so I think).

Yes, please, consider using reportupdater-queries. Recently we made an effort to unify all reportupdater jobs there. :]

@srishakatux if you're adapting a MySQL query to Hive, there are some things to watch out for, like:

  • You should use a bash script like this:
#!/bin/bash
hive -e "
<YOUR QUERY HERE>
" 2> /dev/null | grep -v parquet.hadoop

The redirection of stderr and grep are necessary to avoid some Hive outputs to land in the resulting report. Adding a Hive client to reportupdater is also in our backlog, but still TODO :/

  • Use {1} {2} ... as placeholders for the parameters passed to the script. The first and second params passed to the script are always timestamp_from and timestamp_to, already in YYYY-MM-DD format. The following parameters passed to the script are explode_by parameters (if you don't use this feature just ignore this), they come ordered alphabetically by placeholder name.

Please, let us know if you have any question!
Cheers!

Thank you @JAllemandou and @mforns for your helpful reply :)

Getting closer to the desired outcome, I think. I've made slight changes to Joal's script https://github.com/srish/wmcs-edits/blob/master/wmcs_edits that gives me the output included here: https://github.com/srish/wmcs-edits/blob/master/output.txt.
The config file I've is here https://github.com/srish/wmcs-edits/blob/master/config.yaml.

I've two questions:

Once you review the config.yaml and the wmcs_edits (hql), I will send a patch for adding it to reportupdater-queries in Gerrit.

JAllemandou added a comment.EditedMon, Oct 14, 7:07 PM

Hi @srishakatux :)

The config file I've is here https://github.com/srish/wmcs-edits/blob/master/config.yaml.

One comment on the value you picked for lagin the config. Reportupdater will by default run a query having granularity month after the month is done. For instance it will run query for the month of 2019-11 on 2019-12-01, with start_date = 2019-11-01 and end_date = 2019-12-01 (inclusive start-date, exclusive end-date). The lag parameter is time you add to the default running date for data availability (or any other reason). In our case, data is being copied from mysql databases onto the hadoop cluster on the 1st of the month, and in past months it has been very regular in taking just a few hours. I therefore suggest using a lag of 1 day, also noticing that rerunning reportupdater queries is just a matter of removing the file with incorrect data.

I don't think there is any other place where the time parameters are needed. Scheduling being managed by reportupdater, the query will be run monthly with updated start-date ($1 in script) and end-date ($2).

There is however a change needed in the example code as it is actually not taking advantage of hive partition pruning (we are correcting this right now, you can track it here: T234283)

CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '$1' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '$2'

should be converted to

month = substr('$1', 1, 7)

as the query is monthly.
Another detail: let's not forget to grow the limit 100 on the query to a bigger number allowing to get all rows (limit 10000 is more than enough :)
Last - let's rename the query to wmcs_edits_monthly_by_wiki, to be very explicit :)

  • Also, because of the very nature of the output, I'm wondering if, including start_date as the first parameter, would make sense as it is going to show the same value?

I think this is the limitation of reportupdater we were talking of earlier: you're right it's redundant but it;s not big deal, data is small. On the other hand, I kinda like having complete data in files, as it prevents loosing it in case of file wrong renaming for instance. Anyway, the date is needed for the system to work, even if the system probably could overcome this limitation and if we could functionally do without it :)

Please send a code-review on the gerrit repo when you want.
edited for code correction - Thanks @mforns

Change 543008 had a related patch set uploaded (by Srishakatux; owner: srish):
[analytics/reportupdater-queries@master] Add hive query for wmcs edits

https://gerrit.wikimedia.org/r/543008

Change 543008 merged by Mforns:
[analytics/reportupdater-queries@master] Add hive query for wmcs edits

https://gerrit.wikimedia.org/r/543008

Sorry @srishakatux I didn't notice in the code review that the report name in the config does not match the file name of the script.
They should match for RU to work. Could you please fix that in a new patch?
Thanks a lot.

Change 543709 had a related patch set uploaded (by Srishakatux; owner: Mforns):
[analytics/reportupdater-queries@master] Fix report name in wmcs config

https://gerrit.wikimedia.org/r/543709

Change 543709 abandoned by Srishakatux:
Fix report name in wmcs config

https://gerrit.wikimedia.org/r/543709

Change 543719 had a related patch set uploaded (by Srishakatux; owner: srish):
[analytics/reportupdater-queries@master] Fix report name in wmcs config

https://gerrit.wikimedia.org/r/543719

Change 543719 merged by Mforns:
[analytics/reportupdater-queries@master] Fix report name in wmcs config

https://gerrit.wikimedia.org/r/543719