Page MenuHomePhabricator

Update Topic Datasets for Pageviews & Edits
Open, MediumPublic

Description

In T281922, our goal was to build prototypes of topic datasets to address some specific use cases for content data. Based on feedback and interest after Connie & Jennifer shared the dataset prototypes in the Product People Managers meeting, we want to continue updating the datasets for further exploration.

  • Update datasets monthly, as needed
  • Identify & document dependencies & process for updates
  • Decide on next steps for updates or automation

Event Timeline

kzimmerman updated the task description. (Show Details)
kzimmerman removed a project: Epic.
ldelench_wmf triaged this task as Medium priority.
ldelench_wmf moved this task from Triage to Current Quarter on the Product-Analytics board.

Documenting the process etc. from Research's side for updating the topic datasets:

  • Current update process (~1 hour my time; ~24 hours start-to-finish):
    • Jupyter Notebook (stat1007) that has a data pipeline for gathering all of the current pages and their outlinks (what the model uses for making predictions). That data (~8GB) is then exported to HDFS and moved to the stat machine for further processing. This takes just a few minutes of my time to update to the new snapshots and run, but requires ~1 hour to complete.
    • [Optional] Jupyter Notebook (stat1007) for splitting the data into training/val/test sets, gathering labels, and retraining the model. I do this less frequently (~ once per year) because the groundtruth data is pretty stable/extensive, the model architecture is pretty resilient to hyperparameter choices, and the underlying vocabulary doesn't shift that much so I don't find any large benefits to retraining more frequently. When retraining, this is again not much of my time but does take a few hours of compute to finish.
    • Jupyter Notebook (stat1007) that applies a trained model to the data to get predictions for all articles. Very little of my time but this is the slowest part -- maybe 6 hours to complete.
    • Jupyter Notebook (stat1007) that takes the model predictions and uploads them to Hive for ease of use. Pretty quick -- a few minutes of my time and maybe 20 minutes to complete.
  • Thoughts on automating:
    • I put a bunch of details about the model in this phab comment: T287527#7247069
    • I'd let ML Platform indicate what they think is best but hopefully the Jupyter Notebooks (or code within them) could be incorporated easily into their infrastructure at some point. See T287056 for more details.
    • Airflow for scheduling + papermill for running the notebooks might also be an option.
    • On top of having the data available on a regular schedule, there'd be some other changes that I think should be made:
      • Improved monitoring: throughout the process I have some basic descriptive statistics to make sure coverage hasn't shifted dramatically between runs and that results seem reasonable. In an automated pipeline, this should be incorporated more heavily (ensuring precison/recall remain high). Especially if the model is retrained more frequently, which wouldn't be a bad process.
      • Retrain model more consistently: to keep things simple, I generally only retrain the model about once a year. Presumably an automated pipeline would do this more frequently or at least on a regular cadence.
      • Better structure to data: I've just been creating a new table each time I do this with the snapshot in the tablename so it's clear where the data came from. Presumably it'd be much easier for end-users to have a static tablename that just is updated with new snapshots and retains the previous e.g., six snapshots akin to how many Hive tables function.

Moving this to the backlog, as we have not been able to prioritize engineering work on the topic dataset.

@kzimmerman: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!