Page MenuHomePhabricator

Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links
Closed, ResolvedPublic

Description

One of the comments in the manual evaluation of the link recommendation in T278864#6974599 mentioned that some of the link recommendations appear in sections that typically don't have links, e.g. the "Sources" section, in which links were suggested inside of citations. We should avoid suggesting links in these sections. Examples can be found in https://docs.google.com/spreadsheets/d/1RH5mMC1oTwc_DxE-R6ToEv1cver4Aoh21Ur9XuYIG9c/edit#gid=0

At the moment, we identify elements of raw text from the wikitext of an article using mwparserfromhell. Some of the content in sections such as "Sources" is then likely parsed as raw text and, in turn, used as potential anchor-text for links.

In principle, we can identify specific sections from the wikitext ("== <section-title> ==") and exclude them from being considered by the link recommendation. The challenge will be to identify the relevant section-titles across different languages/wikis. For this, we could potentially use prior work from Research on section-alignment across languages.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Can we do something silly (but also quick and effective), like ignore the last section (maybe last two sections) when parsing the page, on the assumption that these sections may be sources/references sections?

@kostajh -- I appreciate the creativity, but I actually don't think this issue is common or severe enough that we should do a stopgap that could have unintended consequences.

In any case, this should be super rare, right? The algorithm only looks for links in top-level wikitext and the sources section is almost always a bullet list or a <references> tag.

@MGerlach, I'm moving this to the post-release backlog in terms of what Growth team is working on, but please feel free to work on it in the interim.

One example of a link recommendation in a section that should not get it - https://phabricator.wikimedia.org/T277812.

kostajh triaged this task as Medium priority.Apr 26 2021, 7:55 PM

In any case, this should be super rare, right? The algorithm only looks for links in top-level wikitext and the sources section is almost always a bullet list or a <references> tag.

Unfortunately due to a quirk of mwparserfromhell bullet lists are considered top-level wikitext so this is actually not at all rare.

In any case, this should be super rare, right? The algorithm only looks for links in top-level wikitext and the sources section is almost always a bullet list or a <references> tag.

Unfortunately due to a quirk of mwparserfromhell bullet lists are considered top-level wikitext so this is actually not at all rare.

One option would be to add a heuristic to identify text that is part of a list via a regex on the wikicode. For example we could get all text between "#" and "\n" (there are other symbols used for lists too )

r'#(.*?)\n'

When we iterate over the text-nodes provided by mwparserfromhell, we could check whether the text comes from a list; and in that case not add any links at all.

While lists are also used outside of external-references sections, this would be a conservative measure that would probably be relatively simple to implement since it wont require any language-specifics (such as identifying specific section-titles).

In principle, we can identify specific sections from the wikitext ("== <section-title> ==") and exclude them from being considered by the link recommendation. The challenge will be to identify the relevant section-titles across different languages/wikis. For this, we could potentially use prior work from Research on section-alignment across languages.

I remember this research project. It can be very usefully used to identify sections during a configuration process, but some sections still have to be manually added to the corpus. Some wikis have strict norms regarding how the bottom of an article should be organized, but within a wiki, the application of the norm can be incomplete. Other wikis have fewer norms.

For instance, in French you can find.

  • Notes
  • Références
  • Notes et références
  • Bibliographie
  • Webographie
  • Articles connexes
  • Liens externes
  • Voir aussi
  • Annexes
  • ...

In Spanish:

  • Notas
  • Referencias
  • Enlaces externos
  • Véase también
  • Bibliografía
  • Fuentes
  • Libros
  • Periódicos y publicaciones
  • En línea
  • ...

All examples can be used singular or plural, they can be flllowed by an adjective ("Bibliografía utilizada") and they sometimes have sub-sections. Given this diversity, maybe we should have an input section at Special:EditGrowthConfig that would list the most often titles used at a given wiki?

Section titles would have to be detected by the recommendation service, not the MediaWiki extension. We don't have any way currently to integrate community configuration with that.
The SDaW project will include tagging sections with Wikidata IDs some day so maybe that would be a solution, but it's not happening any time soon.

We might be able to identify lists in the mwparserfromhell AST (someone already made an attempt).
Given that mwparser is a highly-relied-upon tool maybe it would be worth looking for options (a small grant, maybe?) for fixing this bug. Although once there are easily available Parsoid dumps, maybe it will lose significance.

In principle, we can identify specific sections from the wikitext ("== <section-title> ==") and exclude them from being considered by the link recommendation. The challenge will be to identify the relevant section-titles across different languages/wikis. For this, we could potentially use prior work from Research on section-alignment across languages.

I remember this research project. It can be very usefully used to identify sections during a configuration process, but some sections still have to be manually added to the corpus. Some wikis have strict norms regarding how the bottom of an article should be organized, but within a wiki, the application of the norm can be incomplete. Other wikis have fewer norms.

Agree, this tool would be useful to identify the corresponding sections in other languages. However, I think at the moment it supports only a handful of languages. There is work planned this year by the Research Team to productionize this model to support many more languages though this will only be done in Q3/Q4 (based on what I know). Thus, this approach will be a very good approach for a long-term solution. If we want a short-term fix we might want to consider other heuristics.

kostajh moved this task from Inbox to Upcoming Work on the Growth-Team board.

My suggestion is to go with the simple option proposed by @MGerlach, but that we use the mwparserfromhell is_li function that @Tgr pointed out rather than the regex.

We've decided to prioritize this task as part of "add a link" iteration 2.

MMiller_WMF raised the priority of this task from Medium to High.Feb 7 2022, 6:17 AM

We should keep an eye on how this affects the task pool size, and/or the time needed to populate the pool. In theory it would take longer to generate enough links, since we're excluding areas of potential text to link from.

Walking the article is very cheap compared to testing the words on the appropriate text nodes, so the processing time for a single article wouldn't increase. The number of links per article might decrease somewhat; some previously valid task candidates might turn invalid because of that, and going through those invalid candidates would slow down the task generation cronjob. I doubt the effect would be significant though. Candidates only need two links to be valid; articles that are sufficiently well-developed to have references / see also sections probably have plenty of anchors for two links.

Actually, I'll put this one back as I didn't make much headway on it and am away next week.

Revisiting this task after some time I see the following options forward in addressing at least parts of it.

Sections.
We can use mwparserfromhell's get_sections to get the different sections (and their titles). The main challenge would be to translate a list of section-title to avoid into other languages. While this will maybe be unfeasible to do manually. However, automatic section-translation, i.e. translating the name of a section from one to another language, is currently being expanded to more languages with promising results T293511. The API is planned but not ready. There exist some example file-dumps with ready-to-use translation between a couple of language pairs.

Lists in wikitext.
It seems that the problem occurs mainly in the context of lists. Thus we could narrow it down to avoid text-nodes that appear as part of lists. This is not super straightforward to do with wikitext. Even if our heuristic for identifying lists only works approximately, it will probably be ok to be more conservative and remove potential text for adding links. There is some example code (which I havent tested) to identify lists and their content. We could try a simple heuristic using the filter_tags method.
For example, for the following wikitext:

==Points of interest==
Some normal text.\n
Now comes a list:
*Richard Howe House (future site)
* Boston Store
# other thing
; another

we could use

for node in mwparserfromhell.parse(wikitext).filter_tags():
    print(node,node.tag)

which yields all lists:

* li
* li
# li
; dt

For each of these lists, we could identify the corresponding content in the wikitext. It seems that list items typically close with "\n" (see the wikitext for an example article containing a list)

pattern = r'[#\*;](.*?)\n'
patterns_match=re.findall(pattern,wikitext)

These snippets could be removed from the wikitext before getting all potential text-nodes as candidates for new links (code)

Lists in HTML
Identifying lists in wikitext might not be straighforward. An alternative could be to use the HTML-version of the article instead of the wikitext. We wouldnt need to rewrite the training but only iterate through the text when doing the prediction to generate new links. In this case, extracting lists is fairly easy <li> text <\li>. For example, for this article you can compare the wikitext and the HTML. The caveat is that we would also need to rewrite the procedure for identifying the snippets of raw text of the article where we want to try new links (code). I havent worked with parsing HTML-code of Wikipedia articles, so I dont know how much effort that would be. Maybe it is easier than working with wikitext? With the availability of new Enterprise HTML dumps, we will be developing a Python-library for easy parsing of HTML-dumps in the next months so this will become easier https://phabricator.wikimedia.org/T302237

Thanks for the review, @MGerlach. I think dealing with lists in wikitext is the most straightforward for now.

We talked about using Parsoid HTML dumps in T259035: Add a link: Parsing challenges (input and output), but ended up not doing it–mostly because the lack of availability made it a non-starter for discussion. (Side note, T302237 is restricted, is that the correct task?)

Thanks for the review, @MGerlach. I think dealing with lists in wikitext is the most straightforward for now.

We talked about using Parsoid HTML dumps in T259035: Add a link: Parsing challenges (input and output), but ended up not doing it–mostly because the lack of availability made it a non-starter for discussion. (Side note, T302237 is restricted, is that the correct task?)

Actually, before going down this path I wanted to ask @Trizek-WMF and @MMiller_WMF what you think of the sections approach (item 1 in T279519#7728518) and letting communities have control over this. The implementation would be something like:

  • Communities use Special:EditGrowthConfig to specify section headings that should block link suggestions, e.g. "Do not offer link suggestions in sections with these titles, e.g. References, Sources: {input field}".
  • when the linkrecommendation service is generating recommendations, it loads the "excluded section headings" configuration from the wiki, and skips generating suggestions when it is iterating over text within one of those sections.

The downside is that communities have to manually specify which section titles ("See also", "References", "Sources" etc) in their language. But the counter-argument is that it's not a lot of work, nothing is broken if communities don't do this step, and also provides more fine-grained control instead of relying on machine translation which can be imperfect.

@kostajh, you approach makes sense to me. Communities that need to exclude these sections will certainly find, or ask us for, the configuration page. As we use the configuration page as our community basecamp, it makes sense to use it.

Can we imagine to allow some advanced configuration? For instance, instead of listing "notes, note, notes and references..." you just have something like "note*"? We could have some false positives, with sections that would have benefited some links, but I think it is perfectly fine as we have plenty of articles to take care of. :)

@kostajh, you approach makes sense to me. Communities that need to exclude these sections will certainly find, or ask us for, the configuration page. As we use the configuration page as our community basecamp, it makes sense to use it.

Can we imagine to allow some advanced configuration? For instance, instead of listing "notes, note, notes and references..." you just have something like "note*"? We could have some false positives, with sections that would have benefited some links, but I think it is perfectly fine as we have plenty of articles to take care of. :)

Yes, I think we could support a wildcard configuration like that, although it might make things more complicated. Let me look at it some and see what is straightforward to do.

We talked about using Parsoid HTML dumps in T259035: Add a link: Parsing challenges (input and output), but ended up not doing it–mostly because the lack of availability made it a non-starter for discussion.

Yes, I remember. The situation changed insofar that now HTML-dumps are publicly available (and also on the stat-machines). thus, in principle, this the lack of availability is not an issue anymore. There would still be the problem of parsing text consistently from the HTML with which we just have less experience compared to working with wikitext.

(Side note, T302237 is restricted, is that the correct task?)

Yes this is the correct task: an outreachy project to build a python-library for working with HTML-dumps (similar to mwparserfromhell for wikitext) which should address the above issue. Sorry, I did not realize the task is restricted until the application period starts on March 25.

Change 769413 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] [WIP] Exclude certain sections from link suggestion generation

https://gerrit.wikimedia.org/r/769413

Change 769529 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] [WIP] Allow excluding sections from link generation

https://gerrit.wikimedia.org/r/769529

Change 769413 abandoned by Kosta Harlan:

[research/mwaddlink@main] [WIP] Exclude certain sections from link suggestion generation

Reason:

In favor of I2c96bcc77e50af91666ccc7031babe54fb2260f9

https://gerrit.wikimedia.org/r/769413

Change 769529 merged by jenkins-bot:

[research/mwaddlink@main] Allow excluding sections from link generation

https://gerrit.wikimedia.org/r/769529

Change 771823 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/771823

Change 771823 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/771823

{'type': 'TypeError', 'description': "'NoneType' object is not iterable", 'trace': ['  File "/opt/lib/python/site-packages/flask/app.py", line 2073, in wsgi_app\n    response = self.full_dispatch_request()\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1518, in full_dispatch_request\n    rv = self.handle_user_exception(e)\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1516, in full_dispatch_request\n    rv = self.dispatch_request()\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1502, in dispatch_request\n    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)\n', '  File "/srv/app/app.py", line 231, in query\n    data["sections_to_exclude"] = list(sections_to_exclude)\n']}

seen when deploying to staging.

Change 771828 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] app: Only cast sections_to_exclude to a list when not in request context

https://gerrit.wikimedia.org/r/771828

Change 771828 merged by jenkins-bot:

[research/mwaddlink@main] app: Make sections_to_exclude a list when not set

https://gerrit.wikimedia.org/r/771828

Change 771856 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/771856

Change 771856 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/771856

Bah, new error after deploying:

{'type': 'IndexError', 'description': 'list index out of range', 'trace': ['  File "/opt/lib/python/site-packages/flask/app.py", line 2073, in wsgi_app\n    response = self.full_dispatch_request()\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1518, in full_dispatch_request\n    rv = self.handle_user_exception(e)\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1516, in full_dispatch_request\n    rv = self.dispatch_request()\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1502, in dispatch_request\n    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)\n', '  File "/srv/app/app.py", line 252, in query\n    sections_to_exclude=data["sections_to_exclude"][:25],\n', '  File "/srv/app/src/query.py", line 54, in run\n    sections_to_exclude=sections_to_exclude,\n', '  File "/srv/app/src/scripts/utils.py", line 317, in process_page\n    not isinstance(section.nodes[0], mwparserfromhell.nodes.heading.Heading)\n', '  File "/opt/lib/python/site-packages/mwparserfromhell/smart_list.py", line 284, in __getitem__\n    return self._render()[key]\n']}

Change 771887 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Rollback to known good version

https://gerrit.wikimedia.org/r/771887

Change 771887 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Rollback to known good version

https://gerrit.wikimedia.org/r/771887

Change 771964 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] process_page: Handle pages where section is empty

https://gerrit.wikimedia.org/r/771964

Bah, new error after deploying:

{'type': 'IndexError', 'description': 'list index out of range', 'trace': ['  File "/opt/lib/python/site-packages/flask/app.py", line 2073, in wsgi_app\n    response = self.full_dispatch_request()\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1518, in full_dispatch_request\n    rv = self.handle_user_exception(e)\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1516, in full_dispatch_request\n    rv = self.dispatch_request()\n', '  File "/opt/lib/python/site-packages/flask/app.py", line 1502, in dispatch_request\n    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)\n', '  File "/srv/app/app.py", line 252, in query\n    sections_to_exclude=data["sections_to_exclude"][:25],\n', '  File "/srv/app/src/query.py", line 54, in run\n    sections_to_exclude=sections_to_exclude,\n', '  File "/srv/app/src/scripts/utils.py", line 317, in process_page\n    not isinstance(section.nodes[0], mwparserfromhell.nodes.heading.Heading)\n', '  File "/opt/lib/python/site-packages/mwparserfromhell/smart_list.py", line 284, in __getitem__\n    return self._render()[key]\n']}

The logs in logstash unfortunately don't tell us which articles this was failing for. So I looked on the mwmaint server where refreshLinkRecommendations is running and found:

$ cat /var/log/mediawiki/mediawiki_job_growthexperiments-refreshLinkRecommendations-s2/syslog.log | grep Error
checking candidate Nelsonia_(rodzaj_ssaka)... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 11:32:56 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[17507]: plwiki:      checking candidate Hrabia_Coventry... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 11:48:47 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[17507]: plwiki:      checking candidate Władcy_Etiopii... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 11:52:12 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[17507]: plwiki:      checking candidate Ministrowie_obrony_(Południowa_Afryka)... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 12:14:50 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[3082]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_parkurové_skákání_jednotlivci... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 12:14:54 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[3082]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_drezúra_jednotlivci... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 12:15:02 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[3082]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_drezúra_družstva... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 12:16:08 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[3082]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_parkurové_skákání_družstva... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 12:40:18 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[3082]: plwiki:      checking candidate Hrabia_Coventry... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 13:16:44 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[32032]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_drezúra_družstva... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 13:17:19 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[32032]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_parkurové_skákání_jednotlivci... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 13:17:20 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[32032]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_drezúra_jednotlivci... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 13:17:37 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[32032]: cswiki:      checking candidate Seznam_olympijských_medailistů_v_jezdectví_–_parkurové_skákání_družstva... There was a problem during the HTTP request: 500 Internal Server Error
Mar 18 13:58:41 mwmaint1002 mediawiki_job_growthexperiments-refreshLinkRecommendations-s2[32032]: plwiki:      checking candidate Władcy_Etiopii...

Looking at these articles, we can see that there is no lead section; instead the article begins with a second level section heading.

The patch above adds a simple check (and test) for these types of articles.

Change 771964 merged by jenkins-bot:

[research/mwaddlink@main] process_page: Handle pages where section is empty

https://gerrit.wikimedia.org/r/771964

Change 771964 merged by jenkins-bot:

[research/mwaddlink@main] process_page: Handle pages where section is empty

https://gerrit.wikimedia.org/r/771964

Redeployed the service after the patch above was merged.

Checked on several wikis (cswiki, frwiki, ruwiki, plwiki, and arwiki) - no addlink suggestions were present in articles sections for the tested articles.

Change 787543 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[research/mwaddlink@main] Make section exclusion case insensitive

https://gerrit.wikimedia.org/r/787543

Change 787543 merged by jenkins-bot:

[research/mwaddlink@main] Make section exclusion case insensitive

https://gerrit.wikimedia.org/r/787543

Change 788404 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/788404

Change 788404 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/788404