Page MenuHomePhabricator

Create reportupdater browser reports that query hive's browser_general table {lama}
Closed, ResolvedPublic13 Estimated Story Points

Description

To be able to show the traffic browser (and OS) breakdown in Dashiki, we need weekly updated report files.
We agreed we would do this with reportupdater querying hive's table browser_general.
We have to either:

  • reuse limn-analytics-data repo
  • or create a new one

The reportupdater report type will be script. It can be a python script that defines a query, introduces the time parameters (maybe others?) and calls hive. It may also need to format hive's output into reportupdater format.

Later on, we can make reportupdater support 'hive' report type, which will handle those things transparenytly, the user will only need to write the query.

Event Timeline

Milimetric set the point value for this task to 5.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric triaged this task as Medium priority.Feb 22 2016, 5:53 PM

Change 272635 had a related patch set uploaded (by Mforns):
Add hive queries for the traffic breakdown reports

https://gerrit.wikimedia.org/r/272635

Change 272635 merged by Milimetric:
Add hive queries for the traffic breakdown reports

https://gerrit.wikimedia.org/r/272635

Two things left to do here:

  • I used beeline directly as a shell script instead of the original python scripts, but I'm hard-coding -n milimetric and -u analytics1015 so I can get around the stats user not having access to run Hive queries
  • The reportupdater-queries repository on stat1002 has CRLF line endings!!! And I don't understand why or how to fix it - some kind of git mess resulting from me messing with the repo I think

The queries all ran last night and each took about 1 hour to completely backfill from the week of Feb 21st back to June 2015. That's pretty good!

@Milimetric Looking into the line ending thing.

@Milimetric

I saw your patch removing the CRLF from the .gitignore file.
In addition to it, in stat1002 the file browser/config.yaml was also line-ended with CRLFs. But not in the original repo. So I just removed it and reset --hard to head to recreate the file, and this time it was copied correctly. At first I though it had something to do with git's core.autocrlf config option, but I don't think so any more.

The reportupdater original repository, mysteriously, has also given CRLFs to a random file: test/fixtures/output/writer_test1.tsv (was not so in limn-mobile-data). Will remove that.
In Stat1002 the reportupdater repository has 4 files with CRLF: ./reportupdater/executor.py ./reportupdater/writer.py ./test/fixtures/output/writer_test1.tsv ./.gitignore. The test one is legit, the others are not.
Anyway will try to remove them all.

Change 275470 had a related patch set uploaded (by Mforns):
Remove CRLFs from file

https://gerrit.wikimedia.org/r/275470

Change 275470 merged by Mforns:
Remove CRLFs from file

https://gerrit.wikimedia.org/r/275470

Done.
Now both reportupdater and reportupdaterq-queries have no CRLF files neither in the original repos, nor in stat1002.
However, I don't know what did cause some non-CRLF files to be checked out in stat1002 as CRLF.

Thanks very much for cleaning that. I continue to be overwhelmed by unseen enemies like encoding and line endings :)

This is still not done, just the line endings part was done. It's the task I was looking for this morning. We need to sort out the access issue and clean up the scripts before calling it done. Andrew's looking into it.

In general we shouldn't mark things done if they're not in the Done column.

Change 276758 had a related patch set uploaded (by Mforns):
Add the query folder as the last parameter of scripts

https://gerrit.wikimedia.org/r/276758

Change 276763 had a related patch set uploaded (by Mforns):
Improve the browser queries

https://gerrit.wikimedia.org/r/276763

Change 277215 had a related patch set uploaded (by Mforns):
Make reportupdater support removing columns

https://gerrit.wikimedia.org/r/277215

I finished testing this and it seems to work, so it's ready for review.

I'll list the patches included in this task:
https://gerrit.wikimedia.org/r/#/c/276758/ (reportupdater's required feature 1)
https://gerrit.wikimedia.org/r/#/c/277215/ (reportupdater's required feature 2)
https://gerrit.wikimedia.org/r/#/c/276763/ (new query scripts and pivot script for reportupdater-queries/browser)

By merging them, they will be automatically deployed.
The only thing we should do is removing the old reports in stat1002:/a/reportupdater/output/browser/

I'm going to review and merge this, and we'll consider my plan the back-up now. I'll pause my work beyond the datepicker.

Change 276758 merged by Milimetric:
Add the query folder as the last parameter of scripts

https://gerrit.wikimedia.org/r/276758

Change 277215 merged by Milimetric:
Make reportupdater support removing columns

https://gerrit.wikimedia.org/r/277215

Change 276763 merged by Milimetric:
Improve the browser queries

https://gerrit.wikimedia.org/r/276763

Change 277818 had a related patch set uploaded (by Mforns):
Change mod of browser query scripts to executable

https://gerrit.wikimedia.org/r/277818

Change 277818 merged by Nuria:
Change mod of browser query scripts to executable

https://gerrit.wikimedia.org/r/277818

Change 277826 had a related patch set uploaded (by Mforns):
Rsync browser reports to datasets.wikimedia.org

https://gerrit.wikimedia.org/r/277826

Change 277826 merged by Ottomata:
Rsync browser reports to datasets.wikimedia.org

https://gerrit.wikimedia.org/r/277826

Change 277987 had a related patch set uploaded (by Mforns):
Correct destination dir in browser reports rsync

https://gerrit.wikimedia.org/r/277987

Change 277987 merged by Ottomata:
Correct destination dir in browser reports rsync

https://gerrit.wikimedia.org/r/277987

Milimetric changed the point value for this task from 5 to 13.Mar 17 2016, 4:29 PM