Page MenuHomePhabricator

Co-organize Fair Ranking Track at TREC 2021
Closed, ResolvedPublic

Description

What?
Wikimedia Foundation's Research team has partnered with the Fair Ranking Track at TREC (a long-standing text retrieval benchmarking conference) for 2021. As a partner, our role is to work with the track organizers to identify and provide a Wikimedia related dataset and a specific question for the participants of the track to work/compete on. (For example, in 2019, the track partnered with Semantic Scholar from the Allen Institute for Artificial Intelligence in designing a competition for fair ranking of scholarly paper abstracts).

Who runs the competition?
National Institute of Standards and Technology (NIST) and the organizers are the same as 2019. (check https://fair-trec.github.io).

NIST interests are in how we measure and audit systems in terms of fairness.

Summary
This dataset will focus on English Wikipedia WikiProjects and building lists of relevant articles to a WikiProject that are fairly ranked. For example, for WikiProject Jazz, what articles are relevant and how do you rank them in such a way that fairly represents different gender identities and geographic regions. This initial challenge will focus on English Wikipedia only so that effort can be focused on the chosen fairness aspects as opposed to challenges of working with multilingual data. Future challenges may expand to other languages but English Wikipedia is one of only a few that uses the PageAssessments extension, which greatly simplifies the process of identifying WikiProjects and what articles are tagged as relevant (and any quality / importance ratings).

Coordinator
@Isaac will act as the coordinator and point of contact on WMF's end.

Event Timeline

leila triaged this task as High priority.Nov 15 2019, 7:11 PM
leila created this task.
leila changed the task status from Open to Stalled.Dec 9 2019, 9:19 PM
leila lowered the priority of this task from High to Medium.
leila edited projects, added Research-Freezer; removed Research.

currently stalled as the track may keep the previous year's dataset.

Isaac edited projects, added Research; removed Research-Freezer.
Isaac subscribed.

Assigning this to myself as it's clear that it's active now. Looks like TREC is interested in a task around building WikiProject worklists while taking into account equity aspects of the articles that show up in the list. I'll continue to meet with the organizers to help shape the task and will provide dataset support over the next few months.

See: https://fair-trec.github.io/

leila changed the task status from Stalled to Open.Jan 4 2021, 7:30 PM
leila awarded a token.

Weekly update:
Discussed how to incorporate some notion of work-required into the rankings. If a ranking model has to perform best for stub articles, it greatly increases the difficulty of the task and highlights a meaningful difference between Wikimedia and most ranking systems. Our expected approach is to use the inverse of an article's quality score (either ORES or a language-agnostic approach such as Lewoniewski et al.'s). More complex approaches such as matching each article with the most similar featured-class article (e.g., building on this approach) and estimating how much work would be required to reach it would be really fun to explore, but our metric for work-required will be shared with participants and so it needs to be fairly reasonable but can be rough.
Other work for this task (gender/geography information as fairness criteria; articles associated with each WikiProject) continue to be developed as part of the ethical recommender and list-building work.

Weekly updates:

Weekly updates:

  • I built a language-agnostic quality model for English to use for understanding the distribution of "work needed" on a given WikiProject's articles. It's a combination of features from ORES articlequality model (lang-agnostic and enwiki), which is itself based on Morten's actionable quality model, and Lewoniewski et al.: log page length, # refs / page length, # images, # level 2/3 headings. It predicts a quality score between 0 and 1 which you could split into six equal-sized classes for Stub, Start, C, B, GA, and FA -- e.g., 0 to 0.167 = Stub. I evaluated it against all the ORES model predictions from last month. Ideally I will eventually pull actual groundtruth data from Wikipedian's assessments but ORES is a good proxy for now (and does not require gathering historical article revisions to generate the data). It achieved a Pearson correlation (linear relationship) with the ORES predictions of 0.896, which I consider to be good enough to proceed (screenshot below of scatterplot of y-axis ORES true vs. x-axis language-agnostic predicted quality scores -- in practice, the predicted scores would be bound to [0,1]). With this model, you can easily see that e.g., articles in WikiProject Military History are generally much higher quality than articles in WikiProject Cities. I developed this model despite the existence of the ORES model for two reasons:
    • It is very fast to make bulk predictions with the model because it uses just a few simple features derived from Mediawiki tables or wikitext and prediction just requires a simple linear regression that has no package dependencies -- i.e. I can just take the input data for an article and transform the data and multiply by learned coefficients to get a predicted score.
    • While I only need this for English right now, the hope is that learned coefficients and approach can eventually be extended to other languages that don't have the same article quality data available as a rough but simple proxy for quality.

Screen Shot 2021-02-19 at 1.31.00 PM.png (408×604 px, 67 KB)

Weekly updates:

  • Discussed specifics of what an easy and hard task would be for the challenge and how to measure the fairness of the resulting lists. Making decisions but no "homework" required of me this week.

Weekly updates:

  • Discussed specifics of what metric would be used to measure the quality of a submission. This mainly relates to whether you take a single static result or many iterations of the same query to allow for different articles to occupy the top spots and more widely share the "exposure". As part of this, I will look into what proportion of recommendations are skipped by Wikipedians. I think I should be able to extract this from the Newcomer Homepage Eventlogging but won't know until I try...

Weekly updates:

  • Discussed specifics of how long result lists should be for the different tasks and worked on task explanation doc that will be shared with participants to frame and describe the task.

Weekly updates:

  • Moving onto process of choosing WikiProjects for training data. Ideally WikiProjects with a mixture of attributes so the fairness criteria can be reasonably applied. Luckily, the spreadsheet I put together earlier of all WikiProjects, # of articles, and details on biographies and geography will help with this process. I would like to add a measure of activity -- probably something around # of articles tagged or rated for quality in the last e.g., 3 months -- to help identify the projects most likely to have high coverage.

Weekly update:

  • Chose initial set of ~50 WikiProjects for training data. Now identifying what keywords we would provide as "queries" for each WikiProject -- e.g., WikiProject Agriculture would be associated with: agriculture, crops, livestock, forests, farming.
  • Came up with an ad-hoc way of assessing WikiProject activity (as a proxy for likely completeness). I look at how many annotations (new articles tagged, new quality assessments, or new importance assessments) were made by a given WikiProject in the last 90 days. No clear threshold between active/inactive (for example, a project might have few recent annotations but still have excellent coverage if their topic is not one that often has new articles and they did much of the tagging work years ago) but it's a good gut-check. Data: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/list-building/enwiki_wikiproject_activity_2021_04_06.tsv

Weekly update:

  • Completed keyword generation and realized that we'll likely want to allow for manual generation of keywords as well because many projects are complicated to describe via search keywords but have a well-defined scope.
  • Put together simple code for extracting articles that can be used for generating the dataset that is presented to participants: https://public.paws.wmcloud.org/55703823/Processing%20Text%20Dumps.ipynb

Weekly update:

  • Nothing concrete but discussion around how to prepare for getting an initial dataset out for participants and do the assessments that will be necessary in the summer when we build the test set -- i.e. for a WikiProject of our creation, is any given Wikipedia article relevant to its scope? NOTE: we won't actually create the WikiProject and tag articles -- this will just be an external dataset for labeling.

Isaac, thank you for your update. (I remain very excited about this work.:) Beyond the research angle, I suggest you check about the question of which articles to potentially exclude with Legal and Security as well. Even though all the data you will use is already public, they may have recommendations for the type of data to exclude to avoid potential issues.

I suggest you check about the question of which articles to potentially exclude with Legal and Security as well

Sounds good -- I'll try to check with them next week.

Weekly update:

  • Checked with Legal and based on advice, no filtering of articles but including URLs to articles as a better form of attribution and making sure to have both a discussion of the limitations of structured data and completeness of data (specific to Wikidata as a source of some of our fairness constructs).
  • Preparing training data set -- list of articles associated w/ each of our chosen WikiProjects, predicted quality, associated continents
  • Official description / dataset should go out shortly

@Isaac thank you for the update and follow up with Legal.

Weekly updates:

Weekly updates:

  • Generated dataset of links to/from articles to assist in any graph-based approaches to list-building.

Weekly updates:

  • No team meeting this week but I generated evaluation metadata to be used in task
  • Remaining support from me should be mostly peripheral for at least the next few months
Isaac renamed this task from Identify and prepare a data-set for Fair Ranking Track at TREC to Co-organize Fair Ranking Track at TREC.May 21 2021, 2:15 PM
Isaac updated the task description. (Show Details)

Weekly updates:

  • Nothing from me this week -- team's focus is on preparing the evaluation metrics and a baseline system which aren't pieces I'm responsible for

Weekly updates:

  • Discussed fairness "goals" for a given result. This is necessary for computing the fairness aspect of the performance. For example, if 90% of the biographies in a WikiProject are about men, what is the expectation for the percentage of biographies in the results list that are about men? Is it 81% (current distribution on enwiki)? ~50% (ideal world where all is equal w/r/t gender)? Something in between 90% and 50%? We're leaning towards the latter (e.g., halfway between 90% and 50%) not because it's actually where we think the goal is, but it provides a mixture of feasibility (the model can only work with the existing articles) and a strong push towards equity. There will be some exceptions of course -- e.g., WikiProject Women Scientists wouldn't be expected to be "fair" w/r/t gender.

Weekly updates:

  • I'm behind on helping with come up with WikiProjects for the evaluuation phase of the challenge but will work on that early next week. Otherwise, the rest of the team is working on the code for evaluating models -- i.e. how to compute the relevance/fairness scores for a given ranking.

Weekly updates:

  • Built evaluation set of 50 WikiProjects with co-organizers. These are mostly new WikiProjects so there's no chance of participants training their systems on them. The assessors will determine what articles would fit into these WikiProjects based on the submissions (they'll only evaluate articles people rank as being relevant).
  • Prepared validation script for submissions (making sure they're in the right format etc.)

Weekly update:

  • 24 submissions from 4 teams came in. I'm told this is fewer than previous years but did have some new teams. I don't know how this compares to the rest of TREC.
  • The number of unique document+query pairs was too many to get fully assessed (~700,000) so been supporting the group in determining a priority for what document+query pairs to evaluate first so we have sufficient groundtruth to accurately judge the results.

Weekly updates:

  • Priority queue designed for assessments and that data now coming in. Just waiting at this point. The rest of the month will be work mainly by others on the team to evaluate the submissions. Then late September / early October I'll be busier again helping the team to write up the findings and considering next steps

Weekly updates: discussing whether we will submit a proposal to run this track again next year. Thoughts:

Time
  • This first year required weekly, hour-long meetings for a period of several months along with occasional additional work time developing datasets etc. This is time obviously that could always be spent elsewhere.
  • Next year, the time commitment would very likely be far less. Though we would hope to tweak the project and would require new datasets, much of the groundwork is there.
Benefits
  • Participation was relatively low this year at 4 teams submitting final systems. We will be discussing how we might increase participation (greater outreach, lower barrier to participation, etc.), but this of course is a concern.
  • One of the main benefits of TREC is that we have assessor time that allows us to generate new, high-quality labeled datasets. For this year, we used that to label new WikiProjects. This helped ensure that no one was accidentally training on the test set, but it didn't generate new data that was particularly novel. Thinking about how to better use assessor time to generate groundtruth datasets that don't currently exist and would help open up new research is one consideration for next year.
  • Per the other organizers, another reason they continue to organize this track is that TREC is unique in scaffolding projects that can bring new students into the fairness / information-retrieval landscape. In a related sense, I am reaching and bringing Wikimedia challenges to students I likely wouldn't connect with because IR is not my field of study.
  • The track was well-aligned with some other internal goals that were prioritized around list-building and prioritization, so this work did generally contribute to my other projects in useful ways. Depending on the tweaks for the next year, that likely would remain to be true.
  • The addition of three external collaborators brings in new perspectives around prioritization / recommendation / fairness that are always useful to hear. We've discussed bringing in some additional collaborators too, which would further add to these benefits (and hopefully reduce workload).
Potential Tweaks
  • This year, we focused solely on English Wikipedia for simplicity. I'd like to see us consider expanding this to other language editions to promote multilingual or language-agnostic approaches to these problems.
  • The organizers are interested in how to model content such as musical albums or gender-oriented health issues that are clearly "gendered" but wouldn't be explicitly labeled as such in Wikidata. Exploring that problem with the hope of being better able to e.g., support campaigns that want to focus on women's health issues instead of just biographies of women would be useful.
  • Thinking about showcasing other datasets at Wikimedia -- e.g., adding a reader aspect to bring in clickstream data

Thank you for your work on this project and for documenting your learnings. Please see below.

Benefits
  • Participation was relatively low this year at 4 teams submitting final systems. We will be discussing how we might increase participation (greater outreach, lower barrier to participation, etc.), but this of course is a concern.

A few questions:

  • How does the participation compare to the previous years for the same track?
  • What is the peak they have seen in this track and do the past/current organizers (including yourself:) have insights why that has happened?
  • Is there a possible correlation between number of participants and the pandemic still being around? (For this track, did people use to meet in person at some point in the year?)

My understanding, and correct me if I'm wrong, is that the time investments for data-sets in TREC pay their dividends in the long run while they can create short-term spikes as well. This is because you effectively have released a standard data-set that researchers will start using and referring to, and that's where a good part of the impact can lie.

How does the participation compare to the previous years for the same track?

Fewer teams though I'm told two of the teams this year are new.

What is the peak they have seen in this track and do the past/current organizers (including yourself:) have insights why that has happened?
Is there a possible correlation between number of participants and the pandemic still being around? (For this track, did people use to meet in person at some point in the year?)

Not sure but I can ask. Pandemic presumably has a lot to do with it. TREC conference was in-person in prior years and labs presumably worked together in-person on it too.

My understanding, and correct me if I'm wrong, is that the time investments for data-sets in TREC pay their dividends in the long run while they can create short-term spikes as well. This is because you effectively have released a standard data-set that researchers will start using and referring to, and that's where a good part of the impact can lie.

Yep, that's the hope. Time will tell.

Weekly updates:

  • Usual meeting was cancelled for logistics reasons
  • I sent along my feedback on whether to submit a proposal for next year and we'll presumably discuss next week

Weekly updates:

  • Discussion of the ideas I raised above with the team.
    • Mulitlingual / non-English component. Team's thoughts:
      • Multilingual retrieval is also quite hard and might detract from the fairness side of the challenge
      • The assessors we have available via NIST are only guaranteed to know English so this could be a challenge
      • Perhaps could work in multilingual aspect though by either focusing on English for the challenge but releasing corresponding data for many languages, have the assessors assessing something that is not language-dependent such as a Wikidata-related piece
      • Interest in the challenge of what articles should be translated between languages (from a fairness standpoint) too, so that might be a way in to expanding beyond English as well.
    • Lasting value of dataset:
      • Remains to be seen but definitely value is having a baseline that people continue to use for years. Agreed that having assessors focus on creating something new in value is more important than preventing "training on the test set" which is essentially what we focused on this round.
    • Bringing in more teams:
      • Didn't have much time to discuss this but feelings are that making it easier to work with the data and focus on the fairness aspect will help with this. Having more demo systems can be beneficial here.
  • Will be working on a proposal in the next week, which won't be explicit on the changes but will likely list some of the options for updates to our approach for next year.

Weekly update: proposal submitted with these ideas so will wait for feedback from organizers.

Isaac renamed this task from Co-organize Fair Ranking Track at TREC to Co-organize Fair Ranking Track at TREC 2021.Oct 13 2021, 6:54 PM

Update: renamed to scope this to TREC 2021 (will create new task for TREC 2022). Leaving open until we submit the notebook paper describing the challenge, results, etc., which essentiallly completes TREC 2021.

Weekly update:

  • Added large section on limitations of data / challenge to TREC notebook

Weekly update: did some cleaning up of notebook paper which will be submitted shortly.

For future context, here's the overview of TREC 2021 including counts of participating teams over the years: https://trec.nist.gov/act_part/conference/papers/overview_30.pdf

Weekly update: I attended TREC conference this week and gave presentation on WikiProjects to give participants more context for their work. Saw presentations by student groups -- approaches generally were two-stage: compute pure relevance-rankings of articles and then rerank per some fairness criteria. Many used explicit fairness criteria that were available or inferred. That's never felt like that would work in a Wikimedia context where we almost never know even close to all the fairness criteria that we'd actually want to respect. A recommendation for next year to help address this was to increase the number of fairness criteria so the explicit approaches start to get overly complicated and push participants towards more general solutions. One group did use a basic semantic diversity approach -- i.e. embed the article and then aim for a diversity of embeddings in the final ranking. That feels more global but would need some work to focus a little bit more as linguistic diversity is not the same as representativeness.

Closing this out -- final report published here: https://trec.nist.gov/pubs/trec30/trec2021.html