Page MenuHomePhabricator

[medium] Implement article matching algorithm for Wikipedia
Open, Needs TriagePublic

Description

Overview

Finding topically-similar articles on Wikipedia is a very useful tool for both researchers and editors. For example, given Tim Cook as an input, one might consider Marissa Mayer to be a good topically-similar article because both individuals were CEOs of major technology platforms. In a process known as matching, this ability to identify similar articles that differ in some key way -- e.g., gender, article quality -- is potentially very useful to both researchers editors.

For researchers, if you want to isolate the impact of e.g., someone's gender on the quality of their article, you might build a sample of 10,000 articles about men and then, for each article, seek to find an article about someone who is similarly notable but identifies as a woman. In this case, Marissa Mayer may be a good match for Tim Cook. For editors, if you want to find a good example of a high-quality article about a similar topic to e.g., identify potential sections or templates to add, Steve Jobs might be a better match as he was also a CEO of Apple and has a substantially longer article.

This specific matching challenge has been carefully studied in Controlled Analyses of Social Biases in Wikipedia Bios by Field et al., who then designed an approach for doing this on Wikipedia using the category system.

Task

This research will implement Field et al.'s approach with the goal of building an API for it such that it may easily used by researchers and editors in their work:

Rationale

If the approach proves to generate reasonable results, it could be worth further investment of resources to build a fully-functional API that could be used by editors. Showcasing a fully-functional pipeline would allow researchers to more easily incorporate it into their work.

Recommended Skills

  • Python: the pipeline and any API would be built in Python
    • Bonus: experience with Flask (API) and SQLite (potential database backend) Python libraries will help with showcasing the tool. Same with Javascript for the front-end, though this can easily be learned or adapted from elsewhere.
  • Algorithms / data structures for understanding the algorithm described in Field et al..

Acceptance Criteria

  • The output of this task will be a PAWS notebook showcasing the pipeline for processing category data, algorithm, and a few examples of outputs.
  • The stretch goal (optional) would take this PAWS notebook and extend it into an interactive web API hosted on Toolforge.
  • NOTE: it is not expected that this tool can be implemented for all of English Wikipedia given its size, but smaller languages like Simple English Wikipedia should work.

Process

  • If you are interested in this task and it is not assigned to anyone, you may begin work on it. Please leave a comment on the task and tag @Isaac so that he is aware.
  • If you have made some progress on the task (draft code implementing algorithm in PAWS notebook) and would like to continue, share a link to your current draft and let @Isaac know so that he can assign the task to you and help you to plot out the next steps.
  • Generally, @Isaac will be able to answer any questions about the task and try to respond quickly when clarification is necessary but response times may be slow if help is needed for more general debugging etc.

Resources

Event Timeline

@Sdkb thanks for pointing this out! I wasn't aware of the template but it certainly is excellent motivation for this work and also could serve as some good evaluation data for any approach that is developed. This task likely won't move particularly quickly but I'm hoping in the next e.g., 6 months to get someone to pick this up. If there's promising progress, happy to brainstorm how it might be used to generate suggestions for the inspiration template.

One potentially similar thing that occurs to me is the related articles extension. Are you familiar with that, and if so, what is the connection/difference?

One potentially similar thing that occurs to me is the related articles extension. Are you familiar with that, and if so, what is the connection/difference?

Yep, good point! The related articles extension uses search indices to find related articles, which largely means it is finding Wikipedia articles that share a lot of words in common. It works quite well in some use cases (thanks for the reminder -- I hadn't dug into it yet well enough to know how useful it might be). For example, here's a random article (en:Pam Cameron) for a politican in Northern Ireland and you can see the Related Articles at the bottom are quite relevant. They could be potential exemplars too in that they're also about Northern Ireland politicians so good match for article structure. Unfortunately, none of those three related articles is significantly higher quality so they don't meet that criteria as an exemplar. This is often going to be the case because a much longer article will have way more content and so almost by definition will overlap less with a small article according to the RelatedArticles extension. The API call being used by RelatedArticles in this case is essentially this: https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=morelike:Pam_Cameron&gsrnamespace=0&gsrqiprofile=classic_noboostlinks

The morelike backend that RelatedArticles uses also allows you to filter by templates so you can kinda hack a more-targeted exemplar articles API call by requiring the results to also have the Featured Article template or Good Article template on it: https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=morelikethis:Pam_Cameron%20hastemplate:featured_article|good_article&gsrnamespace=0&gsrqiprofile=classic_noboostlinks. The results here are less relevant but e.g., en:Helen_McEnteen might work as an exemplar. In general, probably this approach would work sometimes (with some additional filters) but I don't know if it's a full solution.

The reason for developing a new tool aimed at this problem then is several-fold:

  • More flexibility in defining what quality level an exemplar article should reach: with morelike, you're restricted to quality levels that can be tied to templates which is a big constraint and hard to scale across languages. With a more custom approach, I could e.g., restrict to all articles that have a featured article, good article, or recommended article badge on Wikidata as a broader set of high-quality articles.
  • More flexibility in restricting what types of articles are considered related. RelatedArticles will return anything with high text overlap but we might want to take a more structured approach that restricts exemplars to sharing specific categories or Wikidata item properties.
  • Opportunities for doing this across languages -- RelatedArticles uses text similarity so e.g., English articles would never seem similar to articles in other languages. Using categories or other features that can be tied to Wikidata items to define similarity would allow us to find similar articles in other languages for example that might also be potential exemplars.

In reality, the final solution might be a mixture of RelatedArticles and some custom filters to ensure the right sort of similarity between articles and that higher quality examples are returned.