Page MenuHomePhabricator

Add a Link: add "do not link" rule for country names (Q6256) on English Wikipedia
Closed, ResolvedPublic3 Estimated Story Points

Description

User story & summary:

As an English Wikipedian, I want to customize the "Add a link" task to better align with the Manual of Style (MOS) guidelines and the specific norms of my wiki.

Primary concern: The current link suggestion algorithm frequently recommends linking country names (Q6256) , which violates English Wikipedia's MOS:OL. I want the ability to prevent these suggestions to reduce patroller burden and improve compliance with community standards.

Background & research:

The English Wikipedia community has identified compliance with MOS:OL as a key blocker to expanding the "Add a link" feature to more editors. Specifically, continued non-compliant link suggestions have raised concerns, as they:

  • Contravene longstanding community consensus.
  • Create unnecessary maintenance work.
  • Are an issue that could be addressed through software improvements.

Without addressing this issue, community pushback may limit the broader rollout of the feature.

Reference: Wikipedia talk:Growth Team features | Discussion from Wikipedia talk:Growth Team features

Implementation ideas:

Goal: Block country (Q6256) link suggestions at enwiki, as these account for the majority of MOS:OL violations in the current system.

Potential Approaches & (initial/imperfect) Effort Estimates

  1. Add country (Q6256) to the Link Suggestion algorithm's Hard-coded rules for not linking, and then retrain just the enwiki model.
    • Effort: Medium
    • Pros: Quick to implement; allows for testing impact on enwiki.
    • Cons: A temporary fix that doesn’t address broader customization needs and potentially introduces issues as we retrain other language models.
  2. Make the Link Suggestion algorithm's Hard-coded rules for not linking wiki specific so we can modify the list on a per wiki basis.
    • Effort: Large
    • Pros: More flexibility for different language editions.
    • Cons: Requires additional development and maintenance.
  3. Allow communities to configure "do not link" rules via Community Configuration.
Acceptance Criteria:

Investigate and pursue approach 1.
Create subtasks if needed.

Event Timeline

KStoller-WMF triaged this task as Medium priority.
KStoller-WMF moved this task from Inbox to Needs Discussion on the Growth-Team board.
KStoller-WMF set the point value for this task to 3.

Change #1149652 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[research/mwaddlink@main] pipeline: exclude country name anchors in filter step

https://gerrit.wikimedia.org/r/1149652

@KStoller-WMF During this work, should we also use the opportunity and exclude Q5107 "continent" as well per https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking#What_generally_should_not_be_linked including "geographic features" as the second item in the list? That might prevent newcomer edits that link articles like "South America" and then get reverted (An example on enwiki).

Change #1149652 merged by jenkins-bot:

[research/mwaddlink@main] pipeline: exclude country name anchors in filter step

https://gerrit.wikimedia.org/r/1149652

@KStoller-WMF During this work, should we also use the opportunity and exclude Q5107 "continent" as well per https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking#What_generally_should_not_be_linked including "geographic features" as the second item in the list? That might prevent newcomer edits that link articles like "South America" and then get reverted (An example on enwiki).

Good catch and good idea. This wasn't specifically mentioned in enwiki discussions, but I imagine only because these suggestions aren't as common.
If it's easy enough to do, then let's do it!

Change #1152756 had a related patch set uploaded (by Michael Große; author: Michael Große):

[research/mwaddlink@main] pipeline: exclude continent name anchors in filter step

https://gerrit.wikimedia.org/r/1152756

Change #1152756 merged by jenkins-bot:

[research/mwaddlink@main] pipeline: exclude continent name anchors in filter step

https://gerrit.wikimedia.org/r/1152756

The .run-pipeline.sh script finally completed, these are the backtest results:

NthresholdNmicro_precisionmicro_recall
00.0100000.53946748350612630.5941095060980884
10.1100000.62797305348059210.5926390450653057
20.2100000.67643041432690850.5782804255687224
30.3100000.72902853848371290.5468817576334227
40.4100000.77189138949744410.5028544243577545
50.5100000.80731901352426420.43885140979069365
60.6100000.84374057315233790.36286974571873376
70.7100000.87153455681374460.2895692786715101
80.8100000.90259992239037650.20117626708181977
90.9100000.94488606253312140.07710603701781699

At 0.5 threshold these are slightly lower than the ones from Results_round-1, 0.81 | 0.45 | TRUE . But maybe still acceptable. cc @KStoller-WMF. The last boolean seems to refer to a manual validation, but I can't find related docs.

We're on the process to verify the datasets can be loaded into the service, but experiencing some issue when doing this locally as we've done in the past for the big size of the datasets we're working with. I'll get back confirming datasets are loadable asap.

I don't think those numbers are different enough to be concerned.

Is the "threshold" the same as the "Minimum required link score" we expose via Special:CommunityConfiguration/GrowthSuggestedEdits?
I know when we initially released to pilot wikis, we used ".5" as the threshold, but that we actually default to ".6" now (and let communities adjust further).

I don't think those numbers are different enough to be concerned.

Is the "threshold" the same as the "Minimum required link score" we expose via Special:CommunityConfiguration/GrowthSuggestedEdits?
I know when we initially released to pilot wikis, we used ".5" as the threshold, but that we actually default to ".6" now (and let communities adjust further).

I think that's correct.

Also: I'm running into a mysql error when trying to load the datasets, so I can't fully validate these can be loaded:

runuser@3dc1d659c973:/srv/app$ python3 load-datasets.py --wiki-id enwiki --path data/
== Initializing ==
   [general] Ensuring checksum table exists...[OK]
   [general] Ensuring model table exists...[OK]
   [enwiki] Ensuring anchors table exists...[OK]
   [enwiki] Ensuring redirects table exists...[OK]
   [enwiki] Ensuring pageids table exists...[OK]
   [enwiki] Ensuring w2vfiltered table exists...[OK]
   [enwiki] Ensuring model table exists...[OK]
   Beginning process to load datasets for enwiki
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model) for enwiki ==
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Processing dataset: anchors
     Deleting all values from lr_enwiki_anchors...[OK]
     Inserting content into lr_enwiki_anchors...Traceback (most recent call last):
  File "load-datasets.py", line 451, in <module>
    main()
  File "load-datasets.py", line 447, in main
    run(args)
  File "load-datasets.py", line 383, in run
    cursor.execute(line)
  File "/opt/lib/python/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/opt/lib/python/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/opt/lib/python/site-packages/MySQLdb/connections.py", line 259, in query
    _mysql.connection.query(self, query)
MySQLdb._exceptions.OperationalError: (1206, 'The total number of locks exceeds the lock table size')
runuser@3dc1d659c973:/srv/app$

I'm gonna try setting innodb_buffer_pool_size to whatever it is set in the k8s service but maybe the result is the same due to hardware limitations. How risky would it be to try to publish and load the datasets without this check @Urbanecm_WMF . In the scenario the loading does not succeed we would need to restore the prior published datasets for consistency but that's it?

I managed to load the English Wikipedia model locally upon applying the following patch locally:

urbanecm@wmf3345 mwaddlink % git diff
diff --git a/load-datasets.py b/load-datasets.py
index 9d6670b..f00af99 100644
--- a/load-datasets.py
+++ b/load-datasets.py
@@ -225,7 +225,7 @@ def run(args: argparse.Namespace):
         for wiki_id in wiki_ids:
             datasets_to_import = []
             # Start a transaction for each wiki. COMMIT happens after all datasets for the wiki have been updated.
-            mysql_connection.begin()
+            # mysql_connection.begin()
             local_dataset_directory = "%s/%s" % (args.path, wiki_id)
             if args.download:
                 print(
@@ -342,6 +342,7 @@ def run(args: argparse.Namespace):
 
             with mysql_connection.cursor() as cursor:
                 for dataset in datasets_to_import:
+                    mysql_connection.begin()
                     print("  ", "Processing dataset: %s" % dataset)
                     if dataset == "model":
                         print("    ", "Inserting link model...", end="", flush=True)
@@ -407,6 +408,8 @@ def run(args: argparse.Namespace):
                             ),
                         )
                     print(cli_ok_status)
+                    print("  ", "Committing %s..." % dataset, end="", flush=True)
+                    mysql_connection.commit()
 
                 print("  ", "Committing...", end="", flush=True)
                 mysql_connection.commit()
urbanecm@wmf3345 mwaddlink %

This patch is not mergeable, as it risks the model getting loaded half-way (the currently deployed code ensures the model gets loaded in full or not at all by wrapping the whole load in one very large transaction [millions of rows]), but it gets the job done on the local host.

I'll generate a couple of suggestions using the new model, so that they can be verified.

I've loaded the datasets locally and tried querying a page that's known to have a suggestion for Nigeria. This is the result:

urbanecm@wmf3345 ~ % curl -s 'http://localhost:8000/v1/linkrecommendations/wikipedia/en/Igwe_of_Nnewi_kingdom?threshold=0.5&max_recommendations=15' | jq .
{
  "links": [
    {
      "context_after": ", the gran",
      "context_before": "lement of ",
      "link_index": 0,
      "link_target": "Mmaku",
      "link_text": "Mmaku",
      "match_index": 0,
      "score": 0.6924687623977661,
      "wikitext_offset": 1074
    },
    {
      "context_after": ", Umudim a",
      "context_before": "re Obi in ",
      "link_index": 1,
      "link_target": "Uruagu",
      "link_text": "Uruagu",
      "match_index": 0,
      "score": 0.5273683071136475,
      "wikitext_offset": 4488
    },
    {
      "context_after": ". He was a",
      "context_before": "Region of ",
      "link_index": 2,
      "link_target": "Nigeria",
      "link_text": "Nigeria",
      "match_index": 1,
      "score": 0.5563704967498779,
      "wikitext_offset": 5315
    },
    {
      "context_after": ". The item",
      "context_before": " parts of ",
      "link_index": 3,
      "link_target": "Igboland",
      "link_text": "Igboland",
      "match_index": 0,
      "score": 0.6961286067962646,
      "wikitext_offset": 8557
    }
  ],
  "links_count": 4,
  "meta": {
    "application_version": "ec36482",
    "dataset_checksums": {
      "anchors": "cca229b23224c5934d0b9a75c9077a73e9b06ff7cd0f5a573435244c80f91b2c",
      "model": "115d888913e2202614056ed432637c400ecc3631d4cf9ac1041e26bf075ba039",
      "pageids": "beef5e9d9385786d1760fe645f1f0128eedd0e3b78f7b03e1ad3bea0047d65dd",
      "redirects": "f22ab3723fdee3129653981fa5180d13848359aa463503d0b505dfdde56396a9",
      "w2vfiltered": "0f4d575f6d37b4944089873b6db52c1e5f235c54c24c65761686be41f59f5f08"
    },
    "format_version": 1
  },
  "page_title": "Igwe of Nnewi kingdom",
  "pageid": 47833360,
  "revid": 1229683488
}
urbanecm@wmf3345 ~ %

The Nigeria suggestion is present (albeit with a smaller score). That doesn't seem to be correct.

I've loaded the datasets locally and tried querying a page that's known to have a suggestion for Nigeria. This is the result:

[...]

The Nigeria suggestion is present (albeit with a smaller score). That doesn't seem to be correct.

I don't understand how the training works in detail and only have a very superficial understanding of ML in general, but didn't we merely add this part of the training data, as opposed to as a filter during inference? Reducing the score of these suggestions would then be what I would expect as the outcome. And if there are no better suggestions, then it might still show up?

I don't understand how the training works in detail and only have a very superficial understanding of ML in general, but didn't we merely add this part of the training data, as opposed to as a filter during inference? Reducing the score of these suggestions would then be what I would expect as the outcome. And if there are no better suggestions, then it might still show up?

Basically, the suggestion system is based on "what other links do exist". The system first collects all existing links, and for each, it determines the label text (what the link goes from) and the target title (where the link goes). Then, for all possible label texts(*), we are trying to see whether it matches one of the known label texts. If it does, then we examine what links are going from that label and make a suggestion based on that.

The filtering we are talking about alters the dataset of "all existing links". Specifically, it removes all link targets that are an instance of of one or more Wikidata QIDs. Since the link suggestion depends on the pre-existing links, removing a link target means ensuring that link target can never get suggested, because the Add Link machinery does not know that such a link exists.

TLDR: Yes, we are adding the filtering in the training dataset, but since we are pretending nothing links to Nigeria, the algorithm has no reason at all to suggest any link to Nigeria.

Hope this clarifies!

(*) This is determined as n-grams of length 1 to 10: each word in the article is considered alone, with the immediately following word, with the immediately following 2 words and so on (up to "10 words in total").

There are high chances I run the pipeline script from a wrong branch that did not include the new excluded country and continent entities 😿, that would explain why we see Nigeria suggested. I have already re-triggered a training making sure the code change is used. Apologies for that.

That makes me think of some improvements like re-training from a shared directory and improve the terminal prompt or do some checks in the sh script, eg: check branch, check against last trained commit

I don't understand how the training works in detail and only have a very superficial understanding of ML in general, but didn't we merely add this part of the training data, as opposed to as a filter during inference? Reducing the score of these suggestions would then be what I would expect as the outcome. And if there are no better suggestions, then it might still show up?

Basically, the suggestion system is based on "what other links do exist". The system first collects all existing links, and for each, it determines the label text (what the link goes from) and the target title (where the link goes). Then, for all possible label texts(*), we are trying to see whether it matches one of the known label texts. If it does, then we examine what links are going from that label and make a suggestion based on that.

The filtering we are talking about alters the dataset of "all existing links". Specifically, it removes all link targets that are an instance of of one or more Wikidata QIDs. Since the link suggestion depends on the pre-existing links, removing a link target means ensuring that link target can never get suggested, because the Add Link machinery does not know that such a link exists.

TLDR: Yes, we are adding the filtering in the training dataset, but since we are pretending nothing links to Nigeria, the algorithm has no reason at all to suggest any link to Nigeria.

Hope this clarifies!

(*) This is determined as n-grams of length 1 to 10: each word in the article is considered alone, with the immediately following word, with the immediately following 2 words and so on (up to "10 words in total").

Thank you, this actually does help a lot! A follow-up if you're up for it:

If it does, then we examine what links are going from that label and make a suggestion based on that.

Could you expand on that? How do we do get from "text that matches at least one known label text" to "confidence score for that suggestion"? Is that the actual ML part?

If it does, then we examine what links are going from that label and make a suggestion based on that.

Could you expand on that? How do we do get from "text that matches at least one known label text" to "confidence score for that suggestion"? Is that the actual ML part?

Sure! That part uses a binary classifier model, which is trained using both good links (positive examples) and missing/potential links (as in, links that are not present, used as negative examples). During the training pipeline (after we compute the anchor database [the map from link label to link targets I described above]), we:

  1. Generate a set of sentences based on the Wikipedia content (we do not use all possible sentences, just a subset).
  2. For each sentence, we generate all candidate links we could possibly suggest. We do this by generating all possible n-grams and checking whether they are present in the anchor database. If they are, we consider them to be a candidate link. If the n-gram is not present in the anchor database, we discard it.
  3. Every candidate link is labelled as a positive example (correct link) or a negative example (a candidate link that should not be recommended). This is done by looking at the already-existing links in the sentence. For example, let's consider "The Wikimedia Foundation employs software engineers" as a sentence (including the links). In that case, "Wikimedia Foundation -> Wikimedia Foundation"and "software engineers -> software engineering" would be positive examples (as those links actually exist) and "software -> software" or "engineering -> engineer" would be negative examples (as such links do not exist in this sentence [but do from other sentences]). This is used as the source of truth when training the model.
  4. For every candidate, we already know the anchor (link label) and the target title. In this step, we compute features the model then uses for making predictions. Specifically, we make use of the following features:
    • N-gram size: the number of tokens (words) in the anchor
    • Frequency: how many times the anchor-target pair is present in the dictionary
    • Ambiguity: how many distinct target pages are generated for the same n-gram
    • Levenshtein-distance between the anchor and the target title
    • Wiki2Vec distance between the source page and the target page (Wiki2Vec distance is a similarity metric that compares two articles using their content)
    • Kurtosis: the kurtosis of the shape of the distribution of candidate links for a given anchor in the anchor-dictionary
  5. We fit the model using the above-mentioned metrics and label data. This results in a model file, which then can be used to make predictions about arbitrary candidate links, provided you have the same features computed about the candidates you want to score.

When actually making suggestions, we use the anchor database to generate a set of candidate links (that match the label). For each candidate link, we compute the features I listed above, which we feed into the model, which then provides the prediction score.

More details can be seen in the paper about Add Link:

Martin Gerlach, Marshall Miller, Rita Ho, Kosta Harlan, and Djellel Difallah. 2021. Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. In Proceedings of the 30th ACM International Conference on Information &amp; Knowledge Management (CIKM '21). Association for Computing Machinery, New York, NY, USA, 3818–3827. https://doi.org/10.1145/3459637.3481939

Does this clarify?

With the new version of the dataset, the Igwe_of_Nnewi_kingdom article no longer suggests Nigeria as a link:

urbanecm@wmf3345 mwaddlink % curl -s 'http://localhost:8000/v1/linkrecommendations/wikipedia/en/Igwe_of_Nnewi_kingdom?threshold=0.5&max_recommendations=15' | jq .
{
  "links": [
    {
      "context_after": ", the gran",
      "context_before": "lement of ",
      "link_index": 0,
      "link_target": "Mmaku",
      "link_text": "Mmaku",
      "match_index": 0,
      "score": 0.6924687623977661,
      "wikitext_offset": 1074
    },
    {
      "context_after": ", Umudim a",
      "context_before": "re Obi in ",
      "link_index": 1,
      "link_target": "Uruagu",
      "link_text": "Uruagu",
      "match_index": 0,
      "score": 0.5161203742027283,
      "wikitext_offset": 4488
    },
    {
      "context_after": ". The item",
      "context_before": " parts of ",
      "link_index": 2,
      "link_target": "Igboland",
      "link_text": "Igboland",
      "match_index": 0,
      "score": 0.7052578330039978,
      "wikitext_offset": 8557
    }
  ],
  "links_count": 3,
  "meta": {
    "application_version": "ec36482",
    "dataset_checksums": {
      "anchors": "f6c9d88ad9e55b7c0c3d15f29298447fa590d847724553bce87ee53e8aa670e8",
      "model": "af64a2773f50c6e2a075968e1cf3abb588baa31f029a7b973beee387727483f5",
      "pageids": "f96e0c8edd99c773e3477d2c1f2262d73375387e5e4ecc9a8abf39a7ca83dd2d",
      "redirects": "cab35247950b5d2ca8c825d173204f3c348d255d37be7701ec144fee97f51ff5",
      "w2vfiltered": "5063f0bc1c1b1958934df73b6acd0bfa7824e53935d20da34d09703bb8d06227"
    },
    "format_version": 1
  },
  "page_title": "Igwe of Nnewi kingdom",
  "pageid": 47833360,
  "revid": 1229683488
}
urbanecm@wmf3345 mwaddlink %

This is a good sign! I spot checked several other articles that had country links recommended, and the new version of the models does not do that. Let's move this forward!

Mentioned in SAL (#wikimedia-operations) [2025-06-27T08:56:32Z] <urbanecm> Publish new version of Add Link datasets for enwiki (T386867)

New dataset was successfully published:

[urbanecm@stat1008 /home/sgimeno/research-mwaddlink]$ curl https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/enwiki/enwiki.w2vfiltered.sqlite.checksum
3e6efce824ea791ed300822ab8cdf32898c334f48e9c117f066284f66f815f64  ../../data/enwiki/enwiki.w2vfiltered.sqlite.gz
[urbanecm@stat1008 /home/sgimeno/research-mwaddlink]$ cat data/enwiki/enwiki.w2vfiltered.sqlite.checksum
3e6efce824ea791ed300822ab8cdf32898c334f48e9c117f066284f66f815f64  ../../data/enwiki/enwiki.w2vfiltered.sqlite.gz
[urbanecm@stat1008 /home/sgimeno/research-mwaddlink]$

[...]

  1. For every candidate, we already know the anchor (link label) and the target title. In this step, we compute features the model then uses for making predictions. Specifically, we make use of the following features:
    • N-gram size: the number of tokens (words) in the anchor
    • Frequency: how many times the anchor-target pair is present in the dictionary
    • Ambiguity: how many distinct target pages are generated for the same n-gram
    • Levenshtein-distance between the anchor and the target title
    • Wiki2Vec distance between the source page and the target page (Wiki2Vec distance is a similarity metric that compares two articles using their content)
    • Kurtosis: the kurtosis of the shape of the distribution of candidate links for a given anchor in the anchor-dictionary
  2. We fit the model using the above-mentioned metrics and label data. This results in a model file, which then can be used to make predictions about arbitrary candidate links, provided you have the same features computed about the candidates you want to score.

When actually making suggestions, we use the anchor database to generate a set of candidate links (that match the label). For each candidate link, we compute the features I listed above, which we feed into the model, which then provides the prediction score.

More details can be seen in the paper about Add Link:

Martin Gerlach, Marshall Miller, Rita Ho, Kosta Harlan, and Djellel Difallah. 2021. Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. In Proceedings of the 30th ACM International Conference on Information &amp; Knowledge Management (CIKM '21). Association for Computing Machinery, New York, NY, USA, 3818–3827. https://doi.org/10.1145/3459637.3481939

Does this clarify?

It clarifies a lot, thank you! I bookmarked this task for future reference about this^^

With the new version of the dataset, the Igwe_of_Nnewi_kingdom article no longer suggests Nigeria as a link:

[...]

This is a good sign! I spot checked several other articles that had country links recommended, and the new version of the models does not do that. Let's move this forward!

New dataset was successfully published:

[urbanecm@stat1008 /home/sgimeno/research-mwaddlink]$ curl https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/enwiki/enwiki.w2vfiltered.sqlite.checksum
3e6efce824ea791ed300822ab8cdf32898c334f48e9c117f066284f66f815f64  ../../data/enwiki/enwiki.w2vfiltered.sqlite.gz
[urbanecm@stat1008 /home/sgimeno/research-mwaddlink]$ cat data/enwiki/enwiki.w2vfiltered.sqlite.checksum
3e6efce824ea791ed300822ab8cdf32898c334f48e9c117f066284f66f815f64  ../../data/enwiki/enwiki.w2vfiltered.sqlite.gz
[urbanecm@stat1008 /home/sgimeno/research-mwaddlink]$

Great progress! I'm looking forward to this work being done so that we can build on top of it in enwiki.

One note: If I remember correctly, after the last time running the revalidate script, we notice a sharp spike in dangling search index records that seemed to roughly match up with the running of the revalidate script. We should keep an eye on that this time around. Maybe we need to run the fixLinkRecommendationData script after we're done with the revalidate script.

Mentioned in SAL (#wikimedia-operations) [2025-06-27T22:43:00Z] <urbanecm> Start GrowthExperiments:revalidateLinkRecommendations for enwiki (T386867)

Started revalidation:

[urbanecm@deploy1003 ~]$ mwscript-k8s --comment=T386867-revalidate --file enwiki-new-checksum.txt -f GrowthExperiments:revalidateLinkRecommendations -- --wiki=enwiki --verbose --exceptDatasetChecksums enwiki-new-checksum.txt --deleteNullRecommendations
⏳ Starting GrowthExperiments:revalidateLinkRecommendations on Kubernetes as job mw-script.eqiad.5sxxmelx ...
🚀 Job is running.
📜 Streaming logs:
Revalidating link recommendations:
  fetching task batch starting with page 0
  Eaton's Corrasable Bond is outdated, regenerating... success
  HMS Essex is outdated, regenerating... success
  M&M's is outdated, regenerating... success
  HMS Devonshire is outdated, regenerating... success
  Gregory La Cava is outdated, regenerating... All of the links in the recommendation have been pruned
  Decompression illness is outdated, regenerating... success
  WALR-FM is outdated, regenerating... success
  1977 in LGBTQ rights is outdated, regenerating... success
  Salt water aspiration syndrome is outdated, regenerating... success
  List of rocket-powered aircraft is outdated, regenerating... success
  Optus Television is outdated, regenerating... success
  scheduling deleting null recommendation for page ID 80184988... done.
  scheduling deleting null recommendation for page ID 80196018... done.
  scheduling deleting null recommendation for page ID 80247453... done.
  scheduling deleting null recommendation for page ID 80271584... done.
Done; replaced 10869, discarded 2540, null recommendations deleted 160503
[urbanecm@deploy1003 ~]$

Seems we're done?

  scheduling deleting null recommendation for page ID 80184988... done.
  scheduling deleting null recommendation for page ID 80196018... done.
  scheduling deleting null recommendation for page ID 80247453... done.
  scheduling deleting null recommendation for page ID 80271584... done.
Done; replaced 10869, discarded 2540, null recommendations deleted 160503
[urbanecm@deploy1003 ~]$

Seems we're done?

🙌

One note: If I remember correctly, after the last time running the revalidate script, we notice a sharp spike in dangling search index records that seemed to roughly match up with the running of the revalidate script. We should keep an eye on that this time around. Maybe we need to run the fixLinkRecommendationData script after we're done with the revalidate script.

image.png (623×600 px, 50 KB)

(source)

We seem to have gained some dangling records, but it does not seem too bad. Though, I notice that the order of magnitude of discarded recommendations seems to match the order of magnitude of dangling database records. Maybe our mechanism of deleting those in that revalidation script is broken? Though this would seem to be a bug outside the overall scope of this task.

Mentioned in SAL (#wikimedia-operations) [2025-06-30T11:12:57Z] <urbanecm> Start GrowthExperiments:fixLinkRecommendationData --wiki=enwiki --db-table --force (T386867)

Let's fix this:

DB records

[urbanecm@deploy1003 ~]$ mwscript-k8s -f GrowthExperiments:fixLinkRecommendationData -- --wiki=enwiki --db-table --search-index --dry-run
⏳ Starting GrowthExperiments:fixLinkRecommendationData on Kubernetes as job mw-script.eqiad.mh1p0f6d ...
🚀 Job is running.
📜 Streaming logs:
Total number of OK search index entries: 41450
 (results in multiple topics counted multiple times)
Total number of dangling search-index entries: 463
Total number of OK db-table entries: 13720
Total number of dangling db-table entries: 1875
[urbanecm@deploy1003 ~]$ mwscript-k8s -f GrowthExperiments:fixLinkRecommendationData -- --wiki=enwiki --db-table --force
⏳ Starting GrowthExperiments:fixLinkRecommendationData on Kubernetes as job mw-script.eqiad.s3189c3g ...
🚀 Job is running.
📜 Streaming logs:
Total number of OK db-table entries: 13720
Total number of dangling db-table entries: 1875
[urbanecm@deploy1003 ~]$ mwscript-k8s -f GrowthExperiments:fixLinkRecommendationData -- --wiki=enwiki --db-table --search-index --dry-run
⏳ Starting GrowthExperiments:fixLinkRecommendationData on Kubernetes as job mw-script.eqiad.h9ytzgnz ...
🚀 Job is running.
📜 Streaming logs:
Total number of OK search index entries: 41450
 (results in multiple topics counted multiple times)
Total number of dangling search-index entries: 463
Total number of OK db-table entries: 13720
Total number of dangling db-table entries: 0
[urbanecm@deploy1003 ~]$

Search index

[urbanecm@deploy1003 ~]$ mwscript-k8s -f GrowthExperiments:fixLinkRecommendationData -- --wiki=enwiki --search-index
⏳ Starting GrowthExperiments:fixLinkRecommendationData on Kubernetes as job mw-script.eqiad.izev7j6a ...
🚀 Job is running.
📜 Streaming logs:
Total number of OK search index entries: 41450
 (results in multiple topics counted multiple times)
Total number of dangling search-index entries: 463
[urbanecm@deploy1003 ~]$ mwscript-k8s -f GrowthExperiments:fixLinkRecommendationData -- --wiki=enwiki --search-index --dry-run
⏳ Starting GrowthExperiments:fixLinkRecommendationData on Kubernetes as job mw-script.eqiad.r9zlbj8w ...
⏳ Waiting for the container to start...
🚀 Job is running.
📜 Streaming logs:
Total number of OK search index entries: 41074
 (results in multiple topics counted multiple times)
Total number of dangling search-index entries: 1
[urbanecm@deploy1003 ~]$

We should be good to go again.

Etonkovidova subscribed.

Checked on enwiki wmf.7 with api.wikimedia.org/service/linkrecommendation and with user variant 'link-recommendation' - the recommended links do not include countires.