Page MenuHomePhabricator

Add a link: algorithm improvements: Define filter for not linking specific articles types
Open, HighPublic

Description

Which types of article should not be recommended as links by the link-recommendation algorithm?
For example, we only recommend links to articles that are in the main namespace of the same Wikipedia and that are not redirect-pages. In different places we made specifications as to what types of articles we should not link to, such as

A substantial fraction of these can be filtered by looking up statements of the corresponding Wikidata-item, most notably the value for the instance-of property. For example, we can identify the article Statistics (disambiguation) as a disambiguation page (without parsing the title) from the fact that the corresponding Wikidata-item (Q2333935) specifies that it is an "instance of" the Wikidata-item "Wikimedia disambiguation page" (Q4167410).

Currently, we filter an article from the set of candidate-links if it is an instance-of the following items:

Other candidates are:

  • century (Q578)
  • calendar date (Q205892)
  • point in time with respect to recurrent timeframe (Q14795564)
  • ...

Aim:

  • Identify other types of articles we do not want to filter based on linking conventions or feedback
  • Identify the corresponding Wikidata item
  • Add the list of Wikidata items to the filter in the repository and retrain the model (code)

Event Timeline

As a potential guide, I counted how often a given article-type appears as a link. For example, the most common article-type for links in enwiki is the Wikidata-item Q5 (human) with more than 38M occurrences (i.e. biographies). There are only 639 links to articles that are an instance of "century" (Q578) at rank 9068.

https://docs.google.com/spreadsheets/d/1Q2sSDgX1_QlOOzxxRjdMHUaDkd50HWiVjH4MYXh7VWw/edit#gid=1556337929

I dug a bit deeper into the wikidata-ontology to identify article-types related to the calendar such as years or centuries.

The following list are Wikidata-items which characterize Wikipedia-articles related to calendar events, that is the example-article is an instance of the article-type.

If we wanted to filter links to articles related to calendar events, this would be the relevant items to filter.

@MGerlach -- thank you for creating this task. Could you imagine these types of filters being configured downstream at the wiki-level? For instance, if one community did want to link to dates, and another did not? Or does it need to happen upstream at model training?

cc @Urbanecm_WMF regarding T274520: Move Growth configuration to on-wiki JSON file

@MGerlach -- thank you for creating this task. Could you imagine these types of filters being configured downstream at the wiki-level? For instance, if one community did want to link to dates, and another did not? Or does it need to happen upstream at model training?

@MMiller_WMF : at the moment, this filter is defined globally for all wikis in the training pipeline (it is set manually here). We take a simple approach by just removing links matching the filter from the set of candidate-links before training so they will never be considered. I think we could move this filter further downstream, if needed. This would likely require some engineering work since we then need to create and move an additional table such that we can do the lookups for the filter when generating the recommendations (e.g. is the recommended link an instance of century "Q578"). However, the advantage would be that, with this approach, each wiki could then define their own custom filter in a straightforward way by simply providing a list of entity-types that they dont want to link to (e.g. extending or shortening the list on "calendar-related articles" T279434#6977310).

All dates of the year like "12 May" have to be filtered. They are subclasses of Q14795564.

All dates of the year like "12 May" have to be filtered. They are subclasses of Q14795564.

@geraki thanks for this catch.

@kostajh if you think dates are an issue, I can submit a patch to update the list of instances to be included in the corresponding code that filters the anchor-dictionary. However, this would require us to retrain the models to take effect in the API, so we might want to defer to later.

For reference, we currently filter the following instances as qids:

  • Wikimedia disambiguation page (Q4167410)
  • Wikimedia list article (Q13406463)
  • Wikimedia pages related to time/dates:

Regarding the time/date-pages, we would want to expand this list to include the following qids:

All dates of the year like "12 May" have to be filtered. They are subclasses of Q14795564.

@geraki thanks for this catch.

@kostajh if you think dates are an issue, I can submit a patch to update the list of instances to be included in the corresponding code that filters the anchor-dictionary. However, this would require us to retrain the models to take effect in the API, so we might want to defer to later.

For reference, we currently filter the following instances as qids:

  • Wikimedia disambiguation page (Q4167410)
  • Wikimedia list article (Q13406463)
  • Wikimedia pages related to time/dates:

Regarding the time/date-pages, we would want to expand this list to include the following qids:

cc @MMiller_WMF, should we have another pass at updating the time/date pages?

Some examples of suggested articles on days of the year:

  1. এরোমাঙ্গা সেনসেই: suggesting ১০ নভেম্বর (November 10), ৯ এপ্রিল (April 9), ২৫ জুন (June 25)
  2. মার্গারিটা সালাস: suggesting ৩০ নভেম্বর (November 30), ৭ নভেম্বর (November 7)
  3. ডালিয়া গ্রাইবস্কেইট: suggesting ১ মার্চ (March 1)
  4. ইপ্সিতা পাটি: suggesting ১৮ জুন (June 18)

Figure from example 2:


I can provide more examples if needed.

Some examples of suggested articles on days of the year:

  1. এরোমাঙ্গা সেনসেই: suggesting ১০ নভেম্বর (November 10), ৯ এপ্রিল (April 9), ২৫ জুন (June 25)
  2. মার্গারিটা সালাস: suggesting ৩০ নভেম্বর (November 30), ৭ নভেম্বর (November 7)
  3. ডালিয়া গ্রাইবস্কেইট: suggesting ১ মার্চ (March 1)
  4. ইপ্সিতা পাটি: suggesting ১৮ জুন (June 18)

Figure from example 2:


I can provide more examples if needed.

Thanks for these examples. All examples are instances of "Point in time with respect to recurrent timeframe: (Q14795564)". Thus updating the filter for the anchor-dictionary as suggested above would remove those recommendations from the updated model.

@MMiller_WMF it sounds like we should update the list of filters in the mwaddlink tool and update the datasets. That process took about a week of processing time last time around. It will then take some weeks for the older recommendations to cycle out of the MediaWiki cache, or we could write a script to remove them and generate new ones. Let me know how you'd like to proceed, please.

Hello @MGerlach, some suggestions are for dates in cswiki. Here are some examples:

  • in Let Korean Air 858, "27. července" was suggested to link to 27. červenec (an article about date), "12. listopadu" was suggested to link to "12. listopad" (an article about date) and "18. listopadu" was suggested to link to "18. listopad"
  • in "Letiště Edinburgh", "60. let" was suggested to link to "1960-1969" (an article about a decade) and "20. století" was suggested to link to "20. století" (an article about a century)
  • in "Eduard Pagáč", 8. května" was suggested to link to "8. květen" (an article about a date) and "12. března" was suggested to link to "12. březen" (an article about a date).

Could those be excluded, please? Thanks!

Hello @MGerlach, some suggestions are for dates in cswiki. Here are some examples:

  • in Let Korean Air 858, "27. července" was suggested to link to 27. červenec (an article about date), "12. listopadu" was suggested to link to "12. listopad" (an article about date) and "18. listopadu" was suggested to link to "18. listopad"
  • in "Letiště Edinburgh", "60. let" was suggested to link to "1960-1969" (an article about a decade) and "20. století" was suggested to link to "20. století" (an article about a century)
  • in "Eduard Pagáč", 8. května" was suggested to link to "8. květen" (an article about a date) and "12. března" was suggested to link to "12. březen" (an article about a date).

Could those be excluded, please? Thanks!

@Urbanecm_WMF thank you for pointing out those examples. All cases you mention should be excluded with the update suggested in T279434#7093628 since the wikidata-items of the articles recommended as links are all instances of i) Point in time with respect to recurrent timeframe: (Q14795564), or ii) Decade: (Q39911), or iii) Century: (Q578). Though, the update has not yet been deployed.

@MMiller_WMF I think the next step here is to make sure we have a consolidated list of all the Wikidata items we want to exclude, then retrain the models and rebuild datasets for all the wikis. Last time around, the rebuild process took a lot longer than we hoped due to various technical snags; in theory it should be smoother this time around. But it will still take several days to a week due to the time needed for retraining and rebuilding datasets for each wiki.

Because this blocks us deploying to more wikis (T284481), we want to prioritize it. @kostajh -- is this something that Growth engineers do, or that @MGerlach does?

Because this blocks us deploying to more wikis (T284481), we want to prioritize it. @kostajh -- is this something that Growth engineers do, or that @MGerlach does?

@MMiller_WMF I will update the algorithm to include the suggested changes and re-run the model for the pilot wikis this week. work on other wikis will be captured in T284481

A possible Wikidata item to consider for rejection is any article defined as events in a specific year or time period (Q18340514). These links are tricky to use, even when being an experienced user: you have an article about a given train station. Should the construction date link to the city or the railway?

Based on the suggestions in T279434#7143095 T279434#7093628 T284666 I am adding the following entity-types to the filter (all links that are an instance of these entities are removed from the set of candidates):

I am starting to re-run the models and update the datasets. Let me know if you want to add further items or remove some from this list.

Based on the suggestions in T279434#7143095 T279434#7093628 T284666 I am adding the following entity-types to the filter (all links that are an instance of these entities are removed from the set of candidates):

I am starting to re-run the models and update the datasets. Let me know if you want to add further items or remove some from this list.

@MGerlach sounds good to me, but please don't run ./publish-datasets.sh just yet, I'd like to be able to do some sample queries for @MMiller_WMF and our ambassadors before we publish the new datasets.

Change 699395 had a related patch set uploaded (by Kosta Harlan; author: MGerlach):

[research/mwaddlink@main] Update the filter for removing candidate links

https://gerrit.wikimedia.org/r/699395