Page MenuHomePhabricator

Establish processes for running the dataset pipeline
Open, LowPublic

Description

Some things to work out:

  • Where on stat1008 should the canonical repo live, or does that matter? (currently various engineers have run the pipeline from their home directories, doesn't seem like a problem so long as the repo is up-to-date)
  • How often should we re-run the pipeline for an existing dataset? This is needed as page titles get renamed, for example (T274195), as we currently use page title instead of page ID for querying datasets. Should we automate the process via a cronjob?
  • What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.
  • [your questions / concerns here]

Event Timeline

We should answer these questions and document them on https://wikitech.wikimedia.org/wiki/Add_Link

This also relates to plans for scaling the service to new wikis.

Some thoughts from my side.

Where on stat1008 should the canonical repo live, or does that matter?

  • not sure there needs to be a dedicated location since the data will be moved to published datasets, but I dont have a good sense of what is the recommended setup for such a pipeline

How often should we re-run the pipeline for an existing dataset? Should we automate the process via a cronjob?

  • Less often than once a month since we are using the dumps for training the pipeline. These are generated on a monthly basis.
  • More often than once a year, say every 3 months, though this is based on speculation/intuition. While we will always generate the link recommendations for the current version of a page, the underlying data used for training will change as new links (and new pages) are added or removed. I dont think we have a good understanding what the rate of change is and, if and how much it would impact the model performance. This is an interesting research question for the future
  • One option would be to monitor user-feedback on the link-recommendations and check whether the rate of rejection of links increases with time.

What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.

  • we could make a query on a sample page (or sample text) where we would expect some links to be recommended. requiring a specific output might be tricky since the page itself and the model can change, and thus also its output. however, we could require to get at least one link recommended in order to makre sure that the datasets/model work.
  • the evaluation of the backtesting-data provides numbers for precision and recall on a test-dataset with existing links. this should match (at least roughly, say within 10 or 20%) what we expect from previous analysis from before deployment (for example here)

Some thoughts from my side.

Where on stat1008 should the canonical repo live, or does that matter?

  • not sure there needs to be a dedicated location since the data will be moved to published datasets, but I dont have a good sense of what is the recommended setup for such a pipeline

I think we should consolidate on always running the ./run-pipeline.sh and other scripts from @MGerlach's repo which is at /home/mgerlach/REPOS/mwaddlink-gerrit (@MGerlach there are three other repos with mwaddlink in the title in that directory, maybe we should rename those or put them into a subdirectory so it's more obvious that this is the canonical location?)

I've updated the wikitech docs with the repo location

How often should we re-run the pipeline for an existing dataset? Should we automate the process via a cronjob?

  • Less often than once a month since we are using the dumps for training the pipeline. These are generated on a monthly basis.
  • More often than once a year, say every 3 months, though this is based on speculation/intuition. While we will always generate the link recommendations for the current version of a page, the underlying data used for training will change as new links (and new pages) are added or removed. I dont think we have a good understanding what the rate of change is and, if and how much it would impact the model performance. This is an interesting research question for the future
  • One option would be to monitor user-feedback on the link-recommendations and check whether the rate of rejection of links increases with time.

Why don't we go with every 4 months. @MMiller_WMF does this sound OK to you?

What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.

  • we could make a query on a sample page (or sample text) where we would expect some links to be recommended. requiring a specific output might be tricky since the page itself and the model can change, and thus also its output. however, we could require to get at least one link recommended in order to makre sure that the datasets/model work.
  • the evaluation of the backtesting-data provides numbers for precision and recall on a test-dataset with existing links. this should match (at least roughly, say within 10 or 20%) what we expect from previous analysis from before deployment (for example here)

That sounds good. Is this something that Research or Growth team should own? Maybe when we are refreshing the models / datasets, we can post some of this data in the task to get sign off from at least one person in Research and one from Growth? cc @MMiller_WMF & @MGerlach

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

Sounds good. @kevinbazira do you need any information from the Growth team on this? https://wikitech.wikimedia.org/wiki/Add_Link hopefully describes what we are doing now and the README has further documentation, though it could use some improvement (T290067: Improve the documentation in the README of the mwaddlink-repo).

Will your team also maintain the existing service deployment to Kuberntes? If so, T278083: Define SLIs/SLOs for link recommendation service is probably one for us to discuss as well.

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

Sounds good. @kevinbazira do you need any information from the Growth team on this? https://wikitech.wikimedia.org/wiki/Add_Link hopefully describes what we are doing now and the README has further documentation, though it could use some improvement (T290067: Improve the documentation in the README of the mwaddlink-repo).

Will your team also maintain the existing service deployment to Kuberntes? If so, T278083: Define SLIs/SLOs for link recommendation service is probably one for us to discuss as well.

I'm going to remove this from the Growth team's work under the assumption that Machine-Learning-Team will handle it from here on out. Please let us know if you have questions!

How often should we re-run the pipeline for an existing dataset?

I noticed today that cswiki's dataset was last generated in October 2021.

Should we re-run the pipelines every e.g. 6 months? @MGerlach what do you think makes sense?

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

That sounds good; I made T327212: Establish process for periodically refreshing link recommendation models to discuss further.

@kevinbazira: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!