Page MenuHomePhabricator

[Request] List of articles without references AND highly visible
Closed, DeclinedPublic

Description

  • Context.

The #1Lib1Ref (One Librarian, One Reference) campaign encourages librarians, researchers, and volunteers worldwide to improve the reliability of Wikipedia by adding citations to articles that lack references. One of the most impactful ways to contribute is by focusing on highly visible articles that lack proper citations.

  • Description.

I'm looking for a list of the top 100 Wikipedia articles (ordered by visibilty) that meet two key criteria:

  1. They lack references (i.e., they have no references).
  2. They are highly visible (i.e., they receive significant traffic and are frequently viewed by readers).
  • Expected Deliverable.

An actionable code that users could replicate (ideally a Python notebook - and even better: a tool)
If this not possible, the resulting data frames for the following languages (en, es, fr, id, sr, hr, id, pl, ro) is also good for now.
The information delivered last time is desirable (revision_timestamp page_id page_title revision_id page_length num_refs num_wikilinks num_categories num_media num_headings num_views)

  • Estimated Effort.

I believe that since this has already been delivered once, it shouldn't take more than a week.

  • Priority It should be available before the campaign starts in May

I need this task resolved in:

  • 1 month.
  • 3 months.
  • 6 months.
  • Whenever you get to it :-)
  • Other. Do you have any other questions or comments ?

For use by WMF Research team; please leave everything below as it is:

  1. Does the request serve one of the existing Research team's audiences? If yes, choose the primary audience. (1 of 4)
  2. What is the type of work requested?
  3. What is the impact of responding to this request?
    • Support a technology or policy need of one or more WM projects
    • Advance the understanding of the WM projects.
    • Something else. If you choose this option, please explain briefly the impact below.

Event Timeline

SEgt-WMF triaged this task as Medium priority.
leila raised the priority of this task from Medium to Needs Triage.Mar 14 2025, 9:24 PM
leila added a subscriber: Isaac.

Leaving the task as Needing triage until Isaac gets a chance to review. @Isaac to determine prioritization and assignment as relevant. thanks.

A year ago, I adapted the notebook we used for the ICWSM paper for this task: https://gitlab.wikimedia.org/paragon/miscellanea/-/blob/main/notebooks/recent-revisions.ipynb
I have added notes to provide context and highlighted hardcoded values that need to be modified or (ideally) parameterized.

Thanks @Pablo ! I checked in with @SEgt-WMF and summarizing here what we discussed:

  • Research won't be getting to this in Q3 because we currently have a number of other, higher priority projects wrapping up.
  • At the same time, given that this is somewhat of a recurring request and there might be other tweaks that you decide you want to incorporate, I think it would be best if we could help you all run the analysis as opposed to us continuing to pass static snapshots over (the "actionable code" you mention in the description). I think we should be pretty close to that with Pablo's code, which actually have almost everything you need and the main challenge will be updating a few parameters and making sure that you can run the code on the cluster. There are some additional features there (# of images, categories, etc.) that may or may not be relevant to your needs too.
  • I'm recommending that you make an attempt to run the notebook yourselves. If you run into questions / issues, there are some simpler means of getting support outside of this prioritization process. For general issues with running Jupyter notebooks on the cluster, our internal #working-with-data channel would be a good resource. If you have larger questions, you can also book office hours with either Research (Pablo presumably) or Product Analytics (who have a wealth of experience in running notebooks and thinking through data challenges).
  • I'm going to decline this task but please don't hesitate to reach out with further questions. Thanks for your willingness to take a stab at running this code and good luck! I'd also encourage you to share the updated code in this task once you get there just for completeness.

Thank you so much @Pablo and @Isaac - I will share the updated code in this task once I get there!