Page MenuHomePhabricator

Integrate the model training and the deployment of "Add a link" to new Wikipedias exiting the Incubator
Open, Needs TriagePublic

Description

New Wikipedias are created time to time. These Wikipedias might benefit from "Add a link".

The model for "Add a link" has to be trained at the time of the deployment, so that newcomers can have a few tasks to work on, and more when the wiki expands.

Event Timeline

kostajh moved this task from Inbox to Triaged on the Growth-Team board.
kostajh subscribed.

Is there a precedent for process around this? Does something similar happen with ORES, for example?

Seems hard to train a link model for a wiki with near-zero content.

Is there a precedent for process around this? Does something similar happen with ORES, for example?

ORES is limited to largish wikis. It would have the same problem (no training data) plus it also takes significant volunteer effort to train, which for a brand new wiki can be better used elsewhere.

Seems hard to train a link model for a wiki with near-zero content.

How much content does an incubator wiki typically have?

Seems hard to train a link model for a wiki with near-zero content.

How much content does an incubator wiki typically have?

For the last two ones: guw.wp, created last March now have 755 articles. kcg.wp, created yesterday has about 875 articles.

We have a lot of wikis with less than 1,000 articles listed as active wikis. What is the minimum needed to train the models? We can skip all Wikipedias with less than [number] of articles, but then we will have to find a way to monitor them so that they would get the model and the tool when times comes.

Seems hard to train a link model for a wiki with near-zero content.

How much content does an incubator wiki typically have?

For the last two ones: guw.wp, created last March now have 755 articles. kcg.wp, created yesterday has about 875 articles.

We have a lot of wikis with less than 1,000 articles listed as active wikis. What is the minimum needed to train the models?

pinging @MGerlach @kevinbazira about the last question.

@kostajh I dont think there is a well-defined minimum. In principle, you can train on anything, though fewer articles will mean fewer training data. The question is then whether this is enough for the model to learn meaningful patterns from that. I honestly dont have a well-informed answer for that. We should try in any case for these wikis. We could track the performance of the backtesting evaluation for wikis of different sizes and check if there is a significant drop when the number of articles becomes too small.

A few small wikis have been trained in round 4. Adyghe Wikipedia (ady) has 491 articles; Akan Wikipedia ak has 590 articles. They are small Wikipedias. We can use them for checkups. How can we do this?

If you would like to check wiki models before they are deployed, I think the backtesting evaluation can be used for this.

I usually add the backtesting numbers on each task for-example: T304548#7937440. Good indicators should have the precision at around 75% (or more) and the recall should not drop below 20%.

If you would like to check wiki models before they are deployed, I think the backtesting evaluation can be used for this.

I usually add the backtesting numbers on each task for-example: T304548#7937440. Good indicators should have the precision at around 75% (or more) and the recall should not drop below 20%.

There's also usually a gap in time between when the datasets are published (so they can be queried on https://api.wikimedia.org/service/linkrecommendation/apidocs/) and when we start caching recommendations based on those datasets in MediaWiki. So you could use https://api.wikimedia.org/service/linkrecommendation/apidocs/ with the the ady and ak wikis to see how well they perform now.