Page MenuHomePhabricator

Add a link: prioritize suggestions of underlinked articles
Open, HighPublic

Description

In seeing the usage of "add a link" in the wikis, community members have suggested that we direct newcomers' energies toward those articles that need links the most. To that end, we want to figure out a way to prioritize suggesting articles that appear underlinked, based on the ratio of links in the article to what would be expected.

We do not have exact logic to determine how "underlinked" an article is. That logic would be part of this task.

We also know there may be technical challenges depending on how this is implemented. We should talk about pros and cons of different implementations. Depending on the logic, perhaps adjustments could be made via community configuration.

Event Timeline

@Urbanecm_WMF -- I believe you did work to automatically identify underlinked articles. Could you please remind us?

MMiller_WMF renamed this task from Add a link: prioritize suggestions underlinked articles to Add a link: prioritize suggestions of underlinked articles.Feb 7 2022, 6:16 AM

(sorry, wrong task)

@Urbanecm_WMF -- I believe you did work to automatically identify underlinked articles. Could you please remind us?

Hello, sure thing. I created https://articles-needing-links.toolforge.org (source code), which allows users to add {{wikify}} template to articles that have less than appropriate number of links. The tool uses a "bytes per link" approach: it checks how many links an article has, divides it by its length in bytes, and if that number is less than a configured threshold, it reports it to the user as underlinked. For Czech Wikipedia, the treshold was set to 1500 bytes per link (that number was estimated by calculating average bytes per link for Czech featured articles and rounding it up).

In technical terms, the tool uses the following SQL query:

select
    page_id,
    page_title,
    page_len/count(*) as bytes_per_link
from pagelinks
join page on page_id=pl_from
-- only articles that are long-enough
where page_len>{minimum_len} and page_namespace=0
-- not already tagged with {{wikify}}
and page_id not in (select tl_from from templatelinks where tl_title="{wikify_template}")
-- no disambiguation pages
and page_id not in (select pp_page from page_props where pp_propname="disambiguation")
-- not a list 
and page_title not rlike "{list_regex}"
group by page_id having bytes_per_link>{treshold}

When using the tool to stimulate the old unstructurred add a link task, the approach seemed to perform sufficiently well (at least from what I recall).

Hope this helps -- happy to answer more questions about the tool if you have any.

Article length and number of links are both included in the search index so maybe this could be done via a custom CirrusSearch scoring method, which we would use when creating candidates for task generation. It has the same risk all non-random sorts do: that over time users would process the good tasks, and we would end up with a sort according to which the most underlinked articles are all unsuitable for some reason, and every round of the task generation script would have to churn through them and take forever. Some sort of exclusion list in MediaWiki could fix that, I suppose.

Identifying underlinked articles by calculating bytes per links sounds like a good approach. A few comments/caveats:

  • When using the pagelinks table to get the number of links in an article can lead to a highly inflated link-count. From what I understand, the pagelinks table keeps track of all links including those that are transcluded from, e.g., templates and are not explicitly visible in the wikitext as "links in the text". One recent study compared the number of links in enwiki from the html-dump (475M) to the wikitext-dump (171M) meaning a difference of factor ~3 (link to pdf). While I still think the pagelinks table is a good first proxy, it might introduce a bias as use of templates (and their transcluded links) is probably not equally distributed across different types of articles.
  • The threshold for bytes_per_link to define underlinking is probably different for each wiki. Probably, it would make sense to define that value from looking at percentiles of the distribution. For example, one selects the 10% of articles with the lowest scores of bytes_per_links. In enwiki, this corresponds to a threshold value of 47 bytes_per_links (based on the links contained in the wikitext)

Screenshot from 2022-02-25 16-42-58.png (313×450 px, 27 KB)

  • Using the measure of bytes per length, longer articles are more likely to be considered underlinked. At least, for enwiki there seems to be a strong correlation between bytes_per_links and the length (C=0.53). Probably this is because most links appear in the lead section of an article (or more generally in the beginning) so that for longer articles there are much fewer links on average (and thus larger values of bytes_per_link). This is probably fine as it will serve the purpose of avoiding overlinking but I wanted to mention that this might cause a bias as well.

T301648: Allow filtering of WhatLinksHere to remove links from templates was #6 on the Community Tech wishlist, so there's a good chance the pagelink table will be extended within a year or so to track which links are from a template.

@mewoph and I talked about this in our 1:1 last week. One idea would be to has bytes_per_link_threshold and minimum_lengths as values managed via community configuration (with sensible defaults), then refreshLinkRecommendations.php would add a step to its evaluate candidate process to filter out articles matching the SQL query in T301096#7690092. But instead of a SQL query to get all pages, we'd just check each candidate title.

It would be nice if Special:EditGrowthConfig could provide some dynamic feedback to users about representative articles matched by the conditions of bytes_per_link_threshold > bytes_per_link and over minimum_length, so that it is easier for maintainers of community configuration to know which values to input for the threshold and minimum length.

kostajh lowered the priority of this task from High to Medium.Mar 14 2022, 8:11 PM

I have moved this task from the "add a link iteration 2" epic to the "improvements" epic. That's because this improvement will not be a blocker for deploying to more wikis. But it is still a near-term priority and belongs on our sprint board.

Copying a comment made on Slack:

Do we want to prioritize underlinked articles or limit tasks to them? (The latter would potentially mean less tasks, e.g. hiwiki only has 7K tasks currently, might be worth checking how many tasks such a filter would result in. Of course with per-wiki confiugration wikis with less tasks can always just disable this.)

If we strictly want to prioritize, I can think of two ways:

  • Create a custom CirrusSearch sort, which should be able to use a simple mathematic formula with the number of links and the number of bytes as parameters.
  • When a good task candidate has been found, calculate some sort of underlinkedness score. (This could be more complex, e.g. could look at the Parsoid HTML to filter out links from templates.) Use the score as the weight for the recommendation.link weighted tag. Change HasRecommendationFeature to handle weights (if it needs to be changed, not sure about that).

If we are fine with filtering, then T301096#7773354 could work, but might slow the refresh script down (lots of false positive task candidates, assuming most pages are not underlinked). Pushing the condition into the CirrusSearch query for finding task candidates (either as a filter or as a sort) would be better IMO.

kostajh raised the priority of this task from Medium to High.May 24 2022, 4:24 PM

Per discussion with @MMiller_WMF now, we want to prioritize rather than filter.

I am not exactly sure how we will combine prioritization with the random sorting that we have, though.

  • Create a custom CirrusSearch sort, which should be able to use a simple mathematic formula with the number of links and the number of bytes as parameters.

That sounds like a good plan, assuming that we don't need to ask the search team for any of their time to get this working.

That sounds like a good plan, assuming that we don't need to ask the search team for any of their time to get this working.

I imagine we'd want a +1 from them just to make sure we aren't doing anything stupid with the custom ElasticSearch query fragment the sort is adding, but I think that's a very lightweight commitment.

Change 801010 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] [WIP] Add rescore method for sorting by underlinkedness

https://gerrit.wikimedia.org/r/801010

MediaWiki core defines the outgoing_links field as SearchIndexField::INDEX_TYPE_KEYWORD, which CirrusSearch translates as type: text, analyzer: keyword. Unfortunately, scripts cannot access 'text' fields, regardless of the analyzer, so while a custom rescore function would work with a type: keyword field, it doesn't work with this one. There is a flag for changing that, but I doubt it's feasible to enable. And without scripting, the ElasticSearch sort and rescore logic doesn't have anything resembling array size (number of outgoing links). Not sure how to recover from that.

It seemed to me that such a task was not created, but here it is! :)

Can we, regardless of filtering, prioritize articles with a certain template?

Can we, regardless of filtering, prioritize articles with a certain template?

We can, via the boost-templates CirrusSearch feature, but that would be another task as the mechanism is different (the only overlap would be that both options would require changing our current random sort method to some real sort that doesn't discard the weights from the search query).

To recap the siutation: GrowthExperiments is finding link recommendation tasks for users with a hasrecommendation:link query (potentially with other filters like articletopic mixed in), with sort=random to avoid edit conflicts. Link recommendations are useful to get new editors engaged and increase retention, but not so useful to improve an established article (it's hard, both for an algorithm and a new editor, to judge which links would be relevant); the community is much more favorable to link recommendation based edits being done on new / underdeveloped articles. Link frequency is a reasonable approximation for how well-developed an article is, so we'd like to weight search results by that. (And hopefully still avoid edit conflicts; not sure how that would work. But that's a secondary concern for now.)

I thought the best approach would be a custom rescore profile in CirrusSearch, using a boost function like doc['text_bytes'] / doc['outgoing_link'].length. That turns out not to work: outgoing_links is a field with type text, and text fields aren't exposed to scripts; and without scripting, there doesn't seem to be any way to use the length of a field (ie. array size) in a rescore query.

As far as I can see the options are:

  1. Maybe there is something I missed, and and there is a way to access text fields in scripts, or to access array size without scripting in a rescore query. That would be great.
  2. The mapping for outgoing_links could be set to fielddata=true, which does allow scripts to access it. But it would probably have an adverse effect on memory usage, and thus performance.
  3. A new outgoing_link_count field could be added to the search index.
  4. We could calculate the rescore factor (article length divided by link count, or something similar) outside Cirrus, and then import the scores (maybe as weights for the recommendation.link/exists weighted tags, and then create a variant of HasRecommendationFeature that would take tag weights into account). But that would probably be both more effort (needs an import mechanism) and less flexible (any changes to the rescore logic would require reimporting the data).

Accessing the array count in realtime without specific mapping will be expensive if even possible. Once indexed the array no longer exists and the only way to find out would be to decode the json blob that contains the entire document, or walk the positions lists and count the gaps. Not really plausible to do while scoring.

A new subfield of outgoing_link would probably be the best bet. Much like how the text.word_count field tokenizes the input into words and counts the number of tokens, the same functionality could be used with the keyword analyzer to count the number of entries in the outgoing_link field. The patch for this should be quite easy, deploying it into production will require reindexing all wikis. Is the intent to use this across the fleet or is there a subset of wikis that could be done? A reindex of all wikis is sadly a bit error prone and takes more than a week to process (and has annoying interactions with various maintenance activities), but a few target wikis would be easy enough to do in a couple days.

Change 816353 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] [DNM] Use params._source for rescoring

https://gerrit.wikimedia.org/r/816353

Thanks a lot for the help @EBernhardson!

Tests for the prioritized search results are available at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing
The first column contains the first 100 100 results from the current (randomized) search logic, the second column from the new prioritized logic. The sheets called <wiki>_with_topic use a search with a topic filter (video games) enabled. (Topic filtering will affect search rankings, so this is significantly different - it is possible that prioritization will perform well with topics but poorly without topics, or vice versa.)
Let me know if a different format or content would be more useful.

The script used to make these files:

#!/bin/bash

WIKIS="arwiki bnwiki cswiki eswiki frwiki huwiki viwiki"
CIRRUS_SERVER="https://search.svc.eqiad.wmnet:9243"

convert_response_to_csv() {
    jq --raw-output \
        --arg column_header "$1" \
        --arg wiki_lang "${2:0:-4}" \
        '[$column_header], (.hits.hits[]._source.title | { "title": ., "titlee": . | gsub(" "; "_") | @uri } | ["=HYPERLINK(\"https://\($wiki_lang).wikipedia.org/wiki/\(.titlee)\",\"\(.title)\")"]) | @csv'
}

for WIKI in $WIKIS; do
    for WITH_TOPIC in _with_topic ''; do
        curl -s "${CIRRUS_SERVER}/${WIKI}_content/_search?pretty" \
                -H 'Content-Type: application/json' \
                -d @<(jq '. | del(.rescore)' addlink_rescore_underlinked${WITH_TOPIC}.json) \
            | convert_response_to_csv 'Normal' "$WIKI" \
            > tmp_addlink_plain${WITH_TOPIC}.csv
        curl -s "${CIRRUS_SERVER}/${WIKI}_content/_search?pretty" \
                -H 'Content-Type: application/json' \
                -d @addlink_rescore_underlinked${WITH_TOPIC}.json \
            | convert_response_to_csv 'Prioritized' "$WIKI" \
            > tmp_addlink_rescore_underlinked${WITH_TOPIC}.csv
        paste -d, tmp_addlink_plain${WITH_TOPIC}.csv tmp_addlink_rescore_underlinked${WITH_TOPIC}.csv > ${WIKI}${WITH_TOPIC}.csv
    done
done

rm tmp_addlink_{plain,rescore_underlinked}{,_with_topic}.csv


(removed duplicate comment)

FWIW on huwiki the search seems to perform reasonably well. Assuming performance is decent in other languages as well, I think there are two other potential issues to think about, but we can do that once the new scoring is in production behind a feature flag:

  • Will it result in too many edit conflicts? Previously we used random sorting, so Add Link edits got dispersed over the whole task pool. With underlinked articles sorted on top, edits will cluster around the tasks with the least-linked articles, so collisions could become more frequent. We can probably just add a random factor to the score if that becomes a problem, so I'm not too worried about this.
  • Do we need to tune the service parameters for underlinked tasks so instead of trying to link the first few linkable words it finds, it suggests some of the most relevant links? We could increase the model score threshold, but that would slow down the request and possibly cause a timeout.

Moving to QA, for the lack of a better column. The sample searches at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing need to be spot-checked.

FWIW on huwiki the search seems to perform reasonably well. Assuming performance is decent in other languages as well, I think there are two other potential issues to think about, but we can do that once the new scoring is in production behind a feature flag:

  • Will it result in too many edit conflicts? Previously we used random sorting, so Add Link edits got dispersed over the whole task pool. With underlinked articles sorted on top, edits will cluster around the tasks with the least-linked articles, so collisions could become more frequent. We can probably just add a random factor to the score if that becomes a problem, so I'm not too worried about this.

I think we'd just see an increase in "No suggestion found" messages shown to users, rather than edit conflicts. Once a user does a link recommendation edit, the cache entry with the link recommendation metadata is removed, so a subsequent visit by another user to that page would trigger the "No suggestion found".

  • Do we need to tune the service parameters for underlinked tasks so instead of trying to link the first few linkable words it finds, it suggests some of the most relevant links?

That sounds like a follow-up task to me but I defer to @KStoller-WMF.

We could increase the model score threshold, but that would slow down the request and possibly cause a timeout.

Do you mean, instead of using the cached link recommendation metadata, we'd issue a new request to the service with different parameters with a higher score threshold?

Moving to QA, for the lack of a better column. The sample searches at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing need to be spot-checked.

@KStoller-WMF @MShilova_WMF @Trizek-WMF who should do the spot-checking? The ambassadors, QA / @Etonkovidova, Growth engineers, some combination of those? See also T301096#8100507 for more context on that file.

Do you mean, instead of using the cached link recommendation metadata, we'd issue a new request to the service with different parameters with a higher score threshold?

I was thinking of calculating the length/links ration in LinkRecommendationUpdater and making the score threshold depend on that.

Moving to QA, for the lack of a better column. The sample searches at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing need to be spot-checked.

@KStoller-WMF @MShilova_WMF @Trizek-WMF who should do the spot-checking? The ambassadors, QA / @Etonkovidova, Growth engineers, some combination of those? See also T301096#8100507 for more context on that file.

Ambassadors are currently working on comparing the search samples. I just created T314299: Add a link: check prioritized suggestions of underlinked articles for clarity.

Do we need to tune the service parameters for underlinked tasks so instead of trying to link the first few linkable words it finds, it suggests some of the most relevant links?

I added T314343 to the "Growth: "add a link" structured task 3.0 Epic".

@Tgr, a question from @Dyolf77_WMF: are ar.wiki prioritized articles reviewed (unflagged)?

@Tgr, a question from @Dyolf77_WMF: are ar.wiki prioritized articles reviewed (unflagged)?

Not necessarily, neither the sorted nor the unsorted query takes flagrev status into account in any way.

Growth Ambassador completed an initial Quality of Add a link suggestions review and we found that the majority of article suggestions that were prioritized were better. About 70% of suggestions the were Prioritized were better than the unprioritized (Normal) article suggestions. To be clear this was a manual review, so there is some subjectivity and the results are likely not statistically significant.

The main difference between "Normal" and "Prioritized" (besides ratio of links to words) is that Prioritized suggestions are more likely to be longer articles (and potentially more often poorer-quality). There is some concern that adding links to articles that are already low-quality might be seen as a very low-impact edit. However there is also the perspective that we are bringing more attention to these articles that need more attention, so perhaps that's an OK tradeoff.

@Trizek-WMF - What else would you add to this summary?
@Tgr - you mentioned that it might be a fair amount of work to complete this task / reindex search results. Do you have a rough estimate of what it would take so we can decide if the impact is worth that effort?

@Tgr - you mentioned that it might be a fair amount of work to complete this task / reindex search results. Do you have a rough estimate of what it would take so we can decide if the impact is worth that effort?

It will be a fair amount of time (search index schema changes take a month or two), not necessarily a fair amount of work. The minimal version of the patch is almost done (IIRC the only thing left is exposing the new settings in Special:EditGrowthConfig), the question is what extra changes will we need? Do we want some kind of beta period where only some people see it? How much will we have to fine-tune the weight function?

Optimistically, there is 2-3 days' worth of work left, plus the Search team has to add a new index (which I *think* isn't a lot of work, but I'm not sure).

@Trizek-WMF - What else would you add to this summary?

Now that we heard from everyone, I closed the subtask. In T314299#8142535 I wrote this summary:

Overall, the new priority setup is better, even if is suggests longer articles, with less links. These longer articles are often low-quality articles, such as unreviewed machine translated articles, lists of series' episodes, movies plots. Some of them are unchanged bot creations.

Suggesting these low quality contents may be problematic, as we encourage newcomers to add links on articles that don't need them (synopsis) or that require more urgent efforts (reviewing translations, adding citations to walls of text, etc.).

@KStoller-WMF, @Tgr, my advice would be to refine the new model to identify articles that are underlinked walls of text with no images. These characteristics are often the ones of translated articles or lists of episodes. Excluding lists would be nice too.

my advice would be to refine the new model to identify articles that are underlinked walls of text with no images. These characteristics are often the ones of translated articles or lists of episodes. Excluding lists would be nice too.

Thank you so much for the evaluation and summary! I agree we might want to consider refining the model slightly.
But I am worried about the scope of this task increasing too much. It seems like some of the concerns mentioned could be mitigated if communities customized their "Add links between articles" configuration, right?

Current "Add links between articles" options to configure that might help:

  • Articles containing categories defined here will not be displayed to users as tasks for this type of task.
  • Landing page to learn more about the link recommendation task type.
  • The list of section names where no link should be recommended.

@Tgr - Or are there any relatively straight-forward ways to proceed with @Trizek-WMF's suggestion that I'm missing? Is it possible to exclude lists?
(My concern is both that we don't want this task to continue to increase in scope and it seems like excluding "underlinked walls of text with no images" is somewhat contradictory to what this task is all about).