Page MenuHomePhabricator

Identify and prepare a data-set for Fair Ranking Track at TREC
Open, MediumPublic


Wikimedia Foundation's Research team has partnered with the Fair Ranking Track at TREC (a long-standing text retrieval benchmarking conference) for 2021. As a partner, our role is to work with the track organizers to identify and provide a Wikimedia related dataset and a specific question for the participants of the track to work/compete on. (For example, in 2019, the track partnered with Semantic Scholar from the Allen Institute for Artificial Intelligence in designing a competition for fair ranking of scholarly paper abstracts).

Who runs the competition?
National Institute of Standards and Technology (NIST) and the organizers are the same as 2019. (check

NIST interests are in how we measure and audit systems in terms of fairness.

This dataset will focus on English Wikipedia WikiProjects and building lists of relevant articles to a WikiProject that are fairly ranked. For example, for WikiProject Jazz, what articles are relevant and how do you rank them in such a way that fairly represents different gender identities and geographic regions. This initial challenge will focus on English Wikipedia only so that effort can be focused on the chosen fairness aspects as opposed to challenges of working with multilingual data. Future challenges may expand to other languages but English Wikipedia is one of only a few that uses the PageAssessments extension, which greatly simplifies the process of identifying WikiProjects and what articles are tagged as relevant (and any quality / importance ratings).

@Isaac will act as the coordinator and point of contact on WMF's end.

Event Timeline

leila triaged this task as High priority.Nov 15 2019, 7:11 PM
leila created this task.
leila changed the task status from Open to Stalled.Dec 9 2019, 9:19 PM
leila lowered the priority of this task from High to Medium.
leila edited projects, added Research-Backlog; removed Research.

currently stalled as the track may keep the previous year's dataset.

Isaac edited projects, added Research; removed Research-Backlog.
Isaac added a subscriber: Isaac.

Assigning this to myself as it's clear that it's active now. Looks like TREC is interested in a task around building WikiProject worklists while taking into account equity aspects of the articles that show up in the list. I'll continue to meet with the organizers to help shape the task and will provide dataset support over the next few months.


leila changed the task status from Stalled to Open.Jan 4 2021, 7:30 PM
leila awarded a token.

Weekly update:
Discussed how to incorporate some notion of work-required into the rankings. If a ranking model has to perform best for stub articles, it greatly increases the difficulty of the task and highlights a meaningful difference between Wikimedia and most ranking systems. Our expected approach is to use the inverse of an article's quality score (either ORES or a language-agnostic approach such as Lewoniewski et al.'s). More complex approaches such as matching each article with the most similar featured-class article (e.g., building on this approach) and estimating how much work would be required to reach it would be really fun to explore, but our metric for work-required will be shared with participants and so it needs to be fairly reasonable but can be rough.
Other work for this task (gender/geography information as fairness criteria; articles associated with each WikiProject) continue to be developed as part of the ethical recommender and list-building work.

Weekly updates:

Weekly updates:

  • I built a language-agnostic quality model for English to use for understanding the distribution of "work needed" on a given WikiProject's articles. It's a combination of features from ORES articlequality model (lang-agnostic and enwiki), which is itself based on Morten's actionable quality model, and Lewoniewski et al.: log page length, # refs / page length, # images, # level 2/3 headings. It predicts a quality score between 0 and 1 which you could split into six equal-sized classes for Stub, Start, C, B, GA, and FA -- e.g., 0 to 0.167 = Stub. I evaluated it against all the ORES model predictions from last month. Ideally I will eventually pull actual groundtruth data from Wikipedian's assessments but ORES is a good proxy for now (and does not require gathering historical article revisions to generate the data). It achieved a Pearson correlation (linear relationship) with the ORES predictions of 0.896, which I consider to be good enough to proceed (screenshot below of scatterplot of y-axis ORES true vs. x-axis language-agnostic predicted quality scores -- in practice, the predicted scores would be bound to [0,1]). With this model, you can easily see that e.g., articles in WikiProject Military History are generally much higher quality than articles in WikiProject Cities. I developed this model despite the existence of the ORES model for two reasons:
    • It is very fast to make bulk predictions with the model because it uses just a few simple features derived from Mediawiki tables or wikitext and prediction just requires a simple linear regression that has no package dependencies -- i.e. I can just take the input data for an article and transform the data and multiply by learned coefficients to get a predicted score.
    • While I only need this for English right now, the hope is that learned coefficients and approach can eventually be extended to other languages that don't have the same article quality data available as a rough but simple proxy for quality.

Weekly updates:

  • Discussed specifics of what an easy and hard task would be for the challenge and how to measure the fairness of the resulting lists. Making decisions but no "homework" required of me this week.

Weekly updates:

  • Discussed specifics of what metric would be used to measure the quality of a submission. This mainly relates to whether you take a single static result or many iterations of the same query to allow for different articles to occupy the top spots and more widely share the "exposure". As part of this, I will look into what proportion of recommendations are skipped by Wikipedians. I think I should be able to extract this from the Newcomer Homepage Eventlogging but won't know until I try...

Weekly updates:

  • Discussed specifics of how long result lists should be for the different tasks and worked on task explanation doc that will be shared with participants to frame and describe the task.

Weekly updates:

  • Moving onto process of choosing WikiProjects for training data. Ideally WikiProjects with a mixture of attributes so the fairness criteria can be reasonably applied. Luckily, the spreadsheet I put together earlier of all WikiProjects, # of articles, and details on biographies and geography will help with this process. I would like to add a measure of activity -- probably something around # of articles tagged or rated for quality in the last e.g., 3 months -- to help identify the projects most likely to have high coverage.

Weekly update:

  • Chose initial set of ~50 WikiProjects for training data. Now identifying what keywords we would provide as "queries" for each WikiProject -- e.g., WikiProject Agriculture would be associated with: agriculture, crops, livestock, forests, farming.
  • Came up with an ad-hoc way of assessing WikiProject activity (as a proxy for likely completeness). I look at how many annotations (new articles tagged, new quality assessments, or new importance assessments) were made by a given WikiProject in the last 90 days. No clear threshold between active/inactive (for example, a project might have few recent annotations but still have excellent coverage if their topic is not one that often has new articles and they did much of the tagging work years ago) but it's a good gut-check. Data:

Weekly update:

  • Completed keyword generation and realized that we'll likely want to allow for manual generation of keywords as well because many projects are complicated to describe via search keywords but have a well-defined scope.
  • Put together simple code for extracting articles that can be used for generating the dataset that is presented to participants: