Page MenuHomePhabricator

Add a link engineering: Maintenance script for retrieving, caching, and updating search index
Closed, ResolvedPublic

Description

This is could be a meta task with some discrete parts that engineers could work on in parallel.

At a high level (see https://wikitech.wikimedia.org/wiki/Add_Link and related docs for more detail), we need a maintenance script that:

  • iterates over ORES topics
  • for each ORES topic
    • get a random list of articles that don't have link recommendations (use an ES query)
      • for each article that doesn't have a link recommendation
        • query the link recommendation service
          • cache the results in the GrowthExperiments MySQL table
          • fire an event to indicate result of querying the service; search pipeline handles updating the index

Additionally there are a couple of parameters we should think about:

  • configurable number of tasks to query for, an defining an allocation of how many link tasks we want per topic (the default is 500)
  • The query/cache/search index update code should be modular enough to reuse in a job, because we may want to do some of this same work on page edit (refreshing recommendations) or deletion (purging the cache).

After selecting the random set of articles for a topic, here are some additional rules. Some apply before calling the link recommendation service, others apply after calling the service but before saving the recommendations to the database.

AttributeInitial setting
Suggested links1) Must have at least 2 suggestions (configurable) per article over X probability score (X should be configurable per wiki). 2) We will only display a maximum of 10 (should be configurable) suggestions per article. If the service provides more than 10, we should save those in the database in case we end up being unable to locate the phrases in the article text for a particular recommendation. Configuration should be allowed via NewcomerTasks.json
Existing linksThe idea is to filter out already well-linked articles. This is probably handled well enough by the link recommendation service, but let's double check on this.
Protection statusExclude articles with any protection
Categories to include/excludeNone, but make configurable via NewcomerTasks.json
Templates to include/excludeNone, but make configurable via NewcomerTasks.json
Time since last edit1 day (configurable via NewcomerTasks.json)
Time since last suggested links editsDo not use article if previous edit was link recommendations edit *or* if previous edit was a revert of a link recommendations edit (see footnote [1])
Article word count (max/min)None (configurable via NewcomerTasks.json)

[1] We want to avoid these two scenarios (but maybe there are better ways to accomplish that avoidance):

  • User A gets 10 link suggestions on a long article. Adds most of them. Then the article goes back in the queue, gets its suggestions regenerated, and User B chooses it from the queue.* User B adds 10 more suggestions. If that keeps happening, the article could get overlinked.
  • User A gets 10 link suggestions and adds them all. Then they get reverted. Then the article’s suggestions are regenerated and it goes back in the queue, where User B adds those same 10 suggestions again. Then it gets reverted again.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think we can get started on this (possibly breaking out into a couple of subtasks). Since we don't have a production endpoint yet, we we can work with a fake LinkRecommendationService provider for responses.

@kostajh said the first step here is to break out subtasks, likely this week.

kostajh updated the task description. (Show Details)
kostajh updated the task description. (Show Details)

From the requirements table above:

Existing links: The idea is to filter out already well-linked articles. This is probably handled well enough by the link recommendation service, but let's double check on this.

And from the footnote in the task description:

User A gets 10 link suggestions on a long article. Adds most of them. Then the article goes back in the queue, gets its suggestions regenerated, and User B chooses it from the queue.* User B adds 10 more suggestions. If that keeps happening, the article could get overlinked.

@MGerlach as I understand it, we do not need to worry about a potential problem with overlinking in either of these scenarios. If I send the link recommendation service an article that is already saturated with links, it's not going to give us more links -- it is clever enough to know if there are already enough links in the article, right?

If so, I think it would probably make sense to accept an increased risk of getting 0 results back (or fewer than N results, where N is the minimum number of recommendations we will save for an article, currently defined as 4) from the link recommendation service, rather than add more logic in our calling code to check for link saturation or excluding articles altogether if they had link recommendations added in the past.

Change 640582 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] [WIP] Maintenance script for updating link recommendations

https://gerrit.wikimedia.org/r/640582

@Urbanecm -- one of the specifications on this task is: The idea is to filter out already well-linked articles. This is probably handled well enough by the link recommendation service, but let's double check on this.

Could you add any notes or suggestions you have on how to identify whether an article is already well-linked? I think you experimented with "links/bytes ratio" in some of your work. Could you also link to the task where you did that work?

I am assuming these need to be per-wiki-configurable although it's not 100% clear from the description: number of tasks per topic, number of links per task.

I am also assuming "Categories to include/exclude" and "Templates to include/exclude" will not be specific to link recommendations but instead a generic list of articles to avoid for any task (such as articles under deletion). For categories we do have such a global list already (although it is empty).

@Urbanecm -- one of the specifications on this task is: The idea is to filter out already well-linked articles. This is probably handled well enough by the link recommendation service, but let's double check on this.

Could you add any notes or suggestions you have on how to identify whether an article is already well-linked? I think you experimented with "links/bytes ratio" in some of your work. Could you also link to the task where you did that work?

So, I basically used featured articles at cswiki as baseline bytes per link ratio, rounded that up, and considered everything with more bytes per link as underlinked. However, my implementation was pretty naive, and configuration process took some time (calculate FA rate and then verify it works as intended).

Stuff I noticed when building that tool:

  • Some articles aren't really text-based, so https://cs.wikipedia.org/wiki/Tabulka_sf%C3%A9rick%C3%BDch_harmonick%C3%BDch_funkc%C3%AD would be suggested, despite it doesn't have any text (this article would be probably deleted if created, but...it's not the only one)
  • Some articles have a large portion of text that's basically unlinkable (section describing book's plot, section describing game's rules, etc).
  • Disambiguation pages should never be recommended (that's true for some of our other features, as well)

The tool itself is available at https://articles-needing-links.toolforge.org/, https://github.com/wikimedia/labs-tools-articles-needing-links has the source.

Change 640582 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Maintenance script for updating link recommendations

https://gerrit.wikimedia.org/r/640582

The main work is done. We'll need to create a cronjob for this, and probably consider how to handle that some wikis don't have link recommendations (a dblist? or just make the cronjob a noop for those wikis?) so moving back to ready to development for doing that.

There are three possible approaches:

  • use a dblist (we don't want to make dblists available to PHP for performance reasons, but it could be used to define the set of wikis to run the cronjob for)
  • use a configuration variable for enabling link recommendations
  • just check whether there is a link-recommendation task type

One of the last two needs to be done anyway to enable the user-facing features so the dblist seems a pointless overhead. Using the task type configuration means there is no way to selectively enable via a cookie or similar mechanism, and makes it hard to track which wikis the feature is enabled on. So a configuration setting seems like the way to go.

[...] So a configuration setting seems like the way to go.

Agreed!

Change 655862 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Configuration flag for disabling link recommendations

https://gerrit.wikimedia.org/r/655862

Change 655863 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] [no-op] GrowthExperiments: Disable link recommendations

https://gerrit.wikimedia.org/r/655863

Change 655865 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Add GrowthExperiments maintenance script

https://gerrit.wikimedia.org/r/655865

Change 655862 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Configuration flag for disabling link recommendations

https://gerrit.wikimedia.org/r/655862

Change 655863 merged by jenkins-bot:
[operations/mediawiki-config@master] [no-op] GrowthExperiments: Disable link recommendations

https://gerrit.wikimedia.org/r/655863

Mentioned in SAL (#wikimedia-operations) [2021-01-20T00:30:20Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:655863|(no-op) GrowthExperiments: Disable link recommendations (T261408)]] (duration: 01m 05s)

@MMiller_WMF, @MGerlach generated some data about the number of link recommendations returned depending on how the score threshold is adjusted. Could you have a look at https://meta.wikimedia.org/wiki/Research:Link_recommendation_model_for_add-a-link_structured_task#Second_set_of_results_(2020-12) and please let us know if you think we should make changes to the default score (0.5) we will use for getting link recommendations for our target wikis?

Thanks for generating this info, @MGerlach! It seems to me that 0.5 is a threshold that basically gets us 70-80% precision with plenty of recall for all our wikis. That basically puts us in the range we want to be in. I think the only circumstances that would cause us to change that threshold are:

  • We are not able to generate enough link recommendations such that all the topics have plenty of articles.
  • We find that users apply bad judgment to the link recommendations, and therefore we need to turn up the precision to yield more good edits.

Thanks for generating this info, @MGerlach! It seems to me that 0.5 is a threshold that basically gets us 70-80% precision with plenty of recall for all our wikis. That basically puts us in the range we want to be in. I think the only circumstances that would cause us to change that threshold are:

  • We are not able to generate enough link recommendations such that all the topics have plenty of articles.

I made an estimate about the number of articles for which we could generate at least, say 5, link recommendations depending on the choice of the threshold (same section a bit below) based on a subset of 500 randomly chosen articles in each language. For example, setting a threshold of 0.5 for arwiki we would get 5 or more recommendation for ~10% of the articles. Extrapolating to the total number of content articles (~1M, see line|2-year|page_type~content|monthly | wikistats), this would mean there are ~100k articles for which the algorithm generates 5 or more link recommendations.

In contrast, for bnwiki the same choice will yield a much smaller number of articles. First, the total number of articles is much smaller (~100k, see line|2-year|page_type~content|monthly | wikistats). Second, for threshold=0.5 the fraction of articles with at least 5 recommendations is smaller (~0.02). This would mean we could only get 5 or more recommendations for ,2000 or so articles. Decreasing the threshold to 0.4 or 0.3 might cost some precision but would substantially increase the number of articles with 5 or more recommendations by a factor of 10 or more such that we end up with >10k articles. While this is not so much an issue for the other larger wikis, for smaller wikis such as bnwiki we might want to be less conservative to have enough articles with recommendations.

Thus, the threshold of 0.5 seems reasonable for larger wikis. However, for smaller wikis (~100k articles or less) this choice might lead to a very small number of articles for which we can generate 5 or more recommendations.

Thus, the threshold of 0.5 seems reasonable for larger wikis. However, for smaller wikis (~100k articles or less) this choice might lead to a very small number of articles for which we can generate 5 or more recommendations.

@MGerlach @kostajh @Tgr -- are we storing all link recommendations at any threshold, and then choosing to display only the ones that have scores about the threshold? Or is the threshold happening upstream before we start storing the recommendations? The reason I ask is I'm wondering how easy it will be to adjust the thresholds for various wikis -- will it be a quick backport? Or would the whole script have to be re-run?

It's happening before storage. Right now, all effect a configuration change would have is that as old tasks are consumed, the new ones are created with the new threshold. For more instantaneous updates we need a way to invalidate old tasks - this is something we were planning to leave out of the initial deployment (it's not too hard, it just didn't seem immediately necessary).

Thus, the threshold of 0.5 seems reasonable for larger wikis. However, for smaller wikis (~100k articles or less) this choice might lead to a very small number of articles for which we can generate 5 or more recommendations.

@MGerlach @kostajh @Tgr -- are we storing all link recommendations at any threshold, and then choosing to display only the ones that have scores about the threshold? Or is the threshold happening upstream before we start storing the recommendations? The reason I ask is I'm wondering how easy it will be to adjust the thresholds for various wikis -- will it be a quick backport? Or would the whole script have to be re-run?

We are storing recommendations for a fixed threshold (i.e. all recommendations for which the probability is above the threshold). the problem with keeping all recommendations (together with their probability values) is that different recommendations might overlap. as a result, accepting one recommendation could invalidate another one if their anchor texts overlap (partially). currently, we prioritize longer anchors given their probability exceeds the threshold (i.e. start with longer anchors and accept if above the threshold, after that this text is blocked for any further link recommendations). if we wanted to store all recommendations and adjust threshold later, we would have to include additional checks to make sure the different recommendations are not interfering with each other. this could be done, but requires work.

I just found a draft comment in phab that I forgot to submit (sorry!), which was a proposal from @RHo to exclude disambiguation pages from getting selected for indexing. If that's something we still want to do, I think that should be a separate task, and we can discuss when to do it.

I just found a draft comment in phab that I forgot to submit (sorry!), which was a proposal from @RHo to exclude disambiguation pages from getting selected for indexing. If that's something we still want to do, I think that should be a separate task, and we can discuss when to do it.

+1 to filing as a separate task for later (leftovers candidate?)

You mean that links should not be recommended on disambiguation pages, right? (As opposed to not recommending links to disambiguation pages.) Good catch, that should probably be done before deployment - links on disambiguation pages can be disruptive. It seems like something that should be done in mwaddlink, though (although if we want to do it in the maintenance script, it's simple).

You mean that links should not be recommended on disambiguation pages, right? (As opposed to not recommending links to disambiguation pages.) Good catch, that should probably be done before deployment - links on disambiguation pages can be disruptive. It seems like something that should be done in mwaddlink, though (although if we want to do it in the maintenance script, it's simple).

Oh I was referring to not linking to disambiguation pages per this Community Wishlist item: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2021/Editing/Warn_when_linking_to_disambiguation_pages

Is this a case of "why not both" to exclude links to and from disambiguation pages?

On some wikis disambiguation bots use the links on the disambiguation page as possible targets, so adding more links confuses them.

Not linking to disambiguation pages is theoretically important as such links will have to be manually fixed by someone (that would make a decent structured task btw), but in practice it doesn't seem like it would come up - since links to disambiguation pages are almost nonexistent, the algorithm will never learn to suggest links to these in the first place, a few edge cases aside (like when a normal page has been recently turned into a disambiguation page). Also I think adding this feature to mwaddlink was already discussed somewhere with @MGerlach, although I can't find it now.
If we are concerned about the edge cases, filtering out disambiguation pages from link targets on the PHP backend side is also fairly straightforward.

Change 665332 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add topic parameter to refreshLinkRecommendations.php

https://gerrit.wikimedia.org/r/665332

Change 665333 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Fix LinkBatch logic in refreshLinkRecommendations.php

https://gerrit.wikimedia.org/r/665333

Change 665334 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Fix change tag handling in refreshLinkRecommendations.php

https://gerrit.wikimedia.org/r/665334

Regarding thresholds: let's stick with 0.5 for now and do the necessary manual script-running to adjust threshold later, after we've collected some data and see the counts of available articles per topic. @MGerlach @kostajh @Tgr how does that sound?

Regarding disambiguation pages: I think it's most important that we don't have disambiguation pages in the feed as articles that need links added. I have a memory that those might be excluded upstream in the algorithm. Is that right, @MGerlach? If not, I think we need to prioritize that for the first version. We also don't want to suggest links to disambiguation pages, but I agree that the algorithm would be unlikely to suggest them.

Change 665332 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add topic parameter to refreshLinkRecommendations.php

https://gerrit.wikimedia.org/r/665332

Change 665333 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Fix LinkBatch logic in refreshLinkRecommendations.php

https://gerrit.wikimedia.org/r/665333

Change 665334 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Fix change tag handling in refreshLinkRecommendations.php

https://gerrit.wikimedia.org/r/665334

Tgr added a subscriber: Etonkovidova.

Done; probably not QA-able, but I'll leave it to @Etonkovidova whether she wants to.
Not running in Beta yet, that will require a puppet change. Let's track that in T274198: Beta wiki configuration for add link project.

Change 655865 merged by Jcrespo:
[operations/puppet@production] Add GrowthExperiments maintenance script

https://gerrit.wikimedia.org/r/655865

Change 673631 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Update GrowthExperiments cronjob parameters

https://gerrit.wikimedia.org/r/673631

Change 673631 merged by RLazarus:
[operations/puppet@production] Update GrowthExperiments cronjob parameters

https://gerrit.wikimedia.org/r/673631

Change 675172 had a related patch set uploaded (by Kosta Harlan; author: MGerlach):
[research/mwaddlink@main] Add filter for recommended links

https://gerrit.wikimedia.org/r/675172

Change 675172 merged by jenkins-bot:
[research/mwaddlink@main] Add filter for recommended links

https://gerrit.wikimedia.org/r/675172

Change 679282 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] Change minimum links per task to 2

https://gerrit.wikimedia.org/r/679282

Per chat, @MMiller_WMF proposes to change the minimum suggestions per article from 4 to 2.

Change 679282 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Change minimum links per task to 2

https://gerrit.wikimedia.org/r/679282