Page MenuHomePhabricator

Migrate 'Editing' Key Product Metrics tables to analytics/wmf-product/jobs
Closed, ResolvedPublic

Description

Migrate specifically ETL notebooks from the following repos:

Scope of this task:

Tables:

  • editor_month (currently neilpquinn)
  • new_editor (currently cchen)
  • Superset: repo_active_editors (will be renamed active editors) (currently cchen)

Related notebook:
https://github.com/wikimedia-research/Editing-movement-metrics
to analytics/wmf-product/jobs using the instructions found here: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/wmf-product/jobs/+/refs/heads/master/movement_metrics/

(Visualization notebooks should be separate from ETL notebooks.)

FYI: Readers-movement-metrics : done in T291956

Remaining Tasks:
1) Notification (email/Slack) re: soon to come changes to neilpquinn.editor_month - @Mayakp.wiki Opened T305992 for this

  • 2) Backfill new_editors - @Iflorez T287420 (related: Also: Irene is the owner that’s inserted data for the past few months, can we change the owner of new_editors to analytics-product and give analytics-product jurisdiction over changes on that data moving forward? Or does Irene remain the only one who can make changes? (FYI: T285503) Potential future Idea to fix the above: backfill > make a copy > delete the table with mixed upload owners > create new table from saved copy)
  • 3) Create product Repo_active_editors ETL @Mayakp.wiki

Event Timeline

Mayakp.wiki triaged this task as Medium priority.Nov 9 2021, 6:25 PM
Iflorez renamed this task from Migrate 'Editing' Key Product Metrics notebooks to analytics/wmf-product/jobs to Migrate 'Editing' Key Product Metrics tables to analytics/wmf-product/jobs.Feb 1 2022, 12:14 AM
Iflorez updated the task description. (Show Details)
Iflorez updated the task description. (Show Details)

Change 766196 had a related patch set uploaded (by Mayakpwiki; author: Mayakpwiki):

[analytics/wmf-product/jobs@master] movement_metrics: Migrate editing tables and ETL - editor_month

https://gerrit.wikimedia.org/r/766196

Forgot to post this earlier - Notes from meeting on Mar 2, 2022
Migrate Editing ETL Patch : 766196/
Review gerrit patch comments and modify code
Review Gerrit patch
Maya will review comments. Make changes and submit new patch to patchset
Pending :
Editing: active_editors (repo_active_editors)
Readers: content_interactions (repo_diversity_interactions)
Gerrit works differently than Github, and doesn't have the concept of ‘branches’. Patches are queued up in a pipeline. Once the first patch is merged only then can you review and merge the second patch
Decision 1: Submit active_editors table in current patch 766196/ ; so Mikhail can review them all together.
Decision 2: Submit a new patch for the Readers content_interactions table. How to submit that patch and make it work parallely (i.e. like a new branch) and not be dependent on earlier patch will be explained by Mikhail in another meeting which Maya will set up next week.
For .gitignore
echo .DS_Store >> ~/.gitignore_global
git config --global core.excludesfile ~/.gitignore_global

Next steps :

  • Timeline on patch merge

Prefer to merge next month (job run Apr 7)
MP would like to do a manual CI
Changes to Feb metrics run, if any
Nope, none as no changes to this cycle
Notify users of deprecated tables

Work on this goal was pushed back because we have prioritized investigation into pageviews and the data loss.
@mpopov has agreed to help with a lot of the action items listed above. We will reconvene this week.

Notes from meeting on Mar 29, 2022
Editing ETL -

  • Test case: for Snapshot not available.
    • Schedule test nb for future month and see if failure notification is triggered.
    • Created T305106

Status -

  • Patch is (almost) ready to be merged to run on April 7th.
  • Action item assigned to Maya : Add table comments and send to Mikhail : editor_month, new_editors, active_editors
    • Done . sent via Slack

Decisions:
Next month, for the calculation of March metrics @Iflorez to do the following-

  1. Manually Run notebook 01a :
    • This will update neilpquinn.editor_month
  2. Do Not run notebook 01b :
    • wmf_product.new_editors will be updated by the ETL automation
  • We want to ensure that our ETL automation does not impact data loads to metrics tables in cchen db.
  • We will plan to deprecate neilpquinn.editor_month, cchen.repo_active_editors, cchen.repo_diversity_interactions in the following month (before May 7th i.e April metrics) once backfill is complete T287420 and superset dashboards use the new datasets in wmf_product T287284

Verification

Note: This verification was done on April 5th, after the 2022-03 snapshot of mediawiki_history became available (April 3rd).

$ git clone https://gerrit.wikimedia.org/r/analytics/wmf-product/jobs
Cloning into 'jobs'...
$ cd jobs
$ git-review -d 766196
Downloading refs/changes/96/766196/3 from gerrit
Switched to branch "review/unknown/766196"

$ kinit

$ hive -f movement_metrics/hive/create_new_editors.hql --database bearloga
OK
$ hive -f movement_metrics/hive/create_editor_month.hql --database bearloga
OK
$ hive -f movement_metrics/hive/create_active_editors.hql --database bearloga
OK
$ hive -f movement_metrics/hive/create_global_markets_pageviews.hql --database bearloga
OK
$ hive -f movement_metrics/hive/create_pageviews_corrected.hql --database bearloga
OK

Then to modify notebooks & queries for testing and run them:

find movement_metrics/notebooks/supplementary -type f -name '*.hql' | xargs sed -i 's/wmf_product/bearloga/g'
find movement_metrics/notebooks -type f -name '*.ipynb' | xargs sed -i 's/wmf_product/bearloga/g'

notebooks_dir="movement_metrics/notebooks"

for notebook in $notebooks_dir/*.ipynb
do
  echo "executing $notebook"
  python -m jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute $notebook
done

Issues found (and then fixed via patch set 4):

  • missing import pandas as pd in editing_01_editor_month.ipynb
  • "No such file or directory: 'supplementary/queries/update_new_editors.hql'" in editing_02_new_editor.ipynb – indeed, the file should use "new_editors" as that is the name of the table
  • bad path in editing_03_active_editors.ipynb ("/user/hive/warehouse/bearloga.db/active editors" instead of "/user/hive/warehouse/bearloga.db/active_editors")

After opening update_new_editors.hql also remembered that it needed reformatting to use CTEs, so that's in patch set 4 too.

Final Verification

$ git stash
$ git stash clear
$ git checkout master
$ git reset --hard origin/master
$ git-review -d 766196
Downloading refs/changes/96/766196/4 from gerrit
Branch review/unknown/766196 already exists - reusing
Switched to branch "review/unknown/766196"

$ rm $notebooks_dir/*.nbconvert.ipynb

$ for notebook in $notebooks_dir/editing*.ipynb
$ do
$   python -m jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute $notebook
$ done
[NbConvertApp] Converting notebook movement_metrics/notebooks/editing_01_editor_month.ipynb to notebook
[NbConvertApp] Writing 5681 bytes to movement_metrics/notebooks/editing_01_editor_month.nbconvert.ipynb
[NbConvertApp] Converting notebook movement_metrics/notebooks/editing_02_new_editor.ipynb to notebook
[NbConvertApp] Writing 5616 bytes to movement_metrics/notebooks/editing_02_new_editor.nbconvert.ipynb
[NbConvertApp] Converting notebook movement_metrics/notebooks/editing_03_active_editors.ipynb to notebook
[NbConvertApp] Writing 10436 bytes to movement_metrics/notebooks/editing_03_active_editors.nbconvert.ipynb

Change 766196 merged by Bearloga:

[analytics/wmf-product/jobs@master] movement_metrics: Migration and cleanup

https://gerrit.wikimedia.org/r/766196

Mayakp.wiki added a subscriber: cchen.

Job ran successfully on April 7, UTC 00:00
QA the following tables -

  • wmf_product.editor_month
  • wmf_product.new_editors
  • wmf_product.active_editors
    • will be done after cchen.repo_active_editors is updated with 2022-03 snapshot

Next steps:

  • inform @Iflorez (Editing metrics) and @cchen (cchen.new_editors) for dependencies
    • done via Slack

[] plan to deprecate neilpquinn.editor_month Opened T305992 for this

Completed QA of all the tables that were in scope for this ETL automation. Data for the latest snapshot looks good.
Next Up : Back fill data into the tables T287420

Huge thanks to @mpopov for making this happen!! And to @Iflorez for all your support!