Page MenuHomePhabricator

Analyze community authored functions that build Wikipedia infoboxes and more
Open, HighPublic

Description

Brief summary

The Abstract Wikipedia initiative will make it possible to generate Wikipedia articles with a combination of community authored programming functions on a "wiki of functions" and the data and lexicographic (dictionary, grammar, etc.) knowledge on Wikidata .

Today the way community authored programming functions are used on different language editions of Wikipedia involves a lot of copying and pasting. If someone wants to calculate the age of someone for a biography in their native language, they may need to first go to English Wikipedia for example, find the community authored programming function that calculates ages in English, then copy and paste it to their non-English Wikipedia. This process is error prone, can lead to code duplication, and worse, improvements to functions on one language edition may not ever make their way to other language editions.

Wouldn't it be easier if all of these functions were instead available centrally and people didn't have to go through this manual process?

This Outreachy task is about an important first step: finding the different community authored functions that are out there and helping to prioritize which ones would be good candidates for centralizing for Abstract Wikipedia and its centralized wiki of functions.

Skills required

We think the successful Outreachy intern will know Python or potentially Scala or R.

The project

You will write code to:

  1. Fetch the different community authored functions (also known as "modules" and "templates") on the wikis, and determine their usage in articles and how many pageviews use each community authored function.
  1. Analyze the similarity between community authored functions hosted across different projects (i.e., looking for redundant or very similar code). This likely requires wireup of some open source packages, meaning the Python (or Scala or R) code probably needs to import some libraries that are good at looking at code similarity. But there are several potential approaches and this is part of the fun and the challenge.
  1. If there's enough time, determine whether there are segments of code that can be turned into pure functions in the wiki of functions, which is an interesting problem domain in computer science. This would likely require wireup with a Lua programming language interpreter (probably also using the wiki software Wikimedia maintains, MediaWiki) and some degree of manual spot checking to verify that identification of stuff for pure functions is correctly identified.

During this project, you get to write open source code, and you'll publish your methodology and a report that will become a subpage of the Abstract Wikipedia project page, which will be shared with community volunteers to aid in prioritization of things to turn into functions. The report can also potentially form the basis for a publishable research paper.

Possible mentor(s)

@DVrandecic, founder of Wikidata and project lead of Abstract Wikipedia at the Wikimedia Foundation
@dr0ptp4kt , Engineering Director at the Wikimedia Foundation

Microtask

To show your interest, we encourage you to try to solve this problem:

  1. Write a script that fetches all of the source code on English Wikipedia in the Module: "namespace". Hint: we have APIs that will make your life easier and you can find good examples of how to call the APIs by using your favorite search engine or looking at some of the tools on Toolforge. Please try to limit the number of page or API fetches to one per second. We also have dumps with the same content that can analyzed offline.
  2. Generate a summary report that includes interesting statistics like number of modules, a histogram of file sizes, and so on.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2020, 6:36 PM
dr0ptp4kt renamed this task from Analyze user-created modules that generate Wikipedia infoboxes and more to Analyze community authored functions that generate Wikipedia infoboxes and more.Sep 23 2020, 6:37 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt changed the visibility from "Public (No Login Required)" to "Custom Policy".Sep 23 2020, 6:40 PM
dr0ptp4kt updated the task description. (Show Details)Sep 23 2020, 6:42 PM
srishakatux changed the visibility from "Custom Policy" to "Outreachy Mentors (Project)".
DVrandecic updated the task description. (Show Details)Sep 24 2020, 9:43 PM
DVrandecic updated the task description. (Show Details)Sep 24 2020, 9:45 PM
dr0ptp4kt renamed this task from Analyze community authored functions that generate Wikipedia infoboxes and more to Analyze community authored functions that build Wikipedia infoboxes and more.Sep 28 2020, 11:46 AM
dr0ptp4kt updated the task description. (Show Details)Sep 28 2020, 12:03 PM
srishakatux changed the visibility from "Outreachy Mentors (Project)" to "Public (No Login Required)".Oct 7 2020, 6:19 PM
Udoka_Ugo added a subscriber: Udoka_Ugo.EditedOct 8 2020, 10:14 AM

Hi, I am Udokaku Ugochukwu, an outreachy applicant for the December cohort, I would like to contribute to this project. I have started the microtask for this project.
May I take up this issue

Hi @Udoka_Ugo. Feel free to post your solution to the microtask to GitHub. Thanks!

Udoka_Ugo added a comment.EditedOct 8 2020, 6:52 PM

okay, thanks @dr0ptp4kt. Please can I get a link to the repository??? I think I am almost done with the microtask.
Do I drop the link to the repository here

Regarding the microtask there is no existing repository, as you are supposed to write a standalone script and summary report on your own.

Udoka_Ugo added a comment.EditedOct 9 2020, 1:44 PM

ok, Thanks Aklapper,
I have done the microtask and sent to a repo I created on my github account.
How would it be reviewed, please.
Also, I would like to know what's next after the microtasks. Thank you

Please provide links, otherwise nobody will be able to find things to review. Thanks. :)
For the general process and steps, please see https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps

Okay thanks
Here is the link to my microtask, https://github.com/UdokaVrede/Wikimedia-Microtask
Please let me know of any modifications I would be required to make in this report.
Thank you

Okay thanks
Here is the link to my microtask, https://github.com/UdokaVrede/Wikimedia-Microtask
Please let me know of any modifications I would be required to make in this report.
Thank you

Hi @Udoka_Ugo, you're going in the correct direction!

The opensearch API returns an incomplete set of pages, so you should try a different API such as https://www.mediawiki.org/wiki/API:Allpages . Note also that you'll need to use continuation (also known as "pagination") to get all of the results, as each API response can only return a subset of the article names. Try making a request using https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=allpages&apnamespace=828&aplimit=max and then after getting the first result set, try setting the apcontinue value to the value of the "continue" field in the first result set, then notice how you can keep repeating that pattern to get more and more records, in batches of 500 records. That's the API Sandbox, but you should be able to adapt the examples into your Python script.

I recommend labeling the axes on the matplotlib report, and it's worth considering if there are additional interesting statistics you might want to generate (at least if you have time to do more statistics).

Udoka_Ugo added a comment.EditedOct 12 2020, 8:06 PM

Alright thanks I'd do that. Sorry for the late reply, I got a broken toe and had to visit the meds.

tanny411 added a subscriber: tanny411.EditedOct 14 2020, 8:28 AM

Hi, I am an outreachy applicant and interested in joining this project. I went through the task and I will get started with it right away.
Just need a little clarification, are we all going to solve the same task or are there other I have to look at?
Thanks :)

Edit: Ran into few questions. I cannot figure out what pageviews means, is it simply count of number of times this page was visited? If so, how can I find how and when those functions were used? To create a usage statistics sort of thing.

Hi @tanny411, the Microtask is the same for all. You'll want to start with the Microtask first, and we'll select the intern based on the Microtask entries.

Regarding "pageviews", yes, this refers to the number of times a page has been read. There are several ways to obtain pageview information (cf. https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews ), but that's not necessary for the Microtask.

Thanks for checking in and please do email me directly at abaso _at_ wikimedia [.] org (trying to avoid spambots here with this email string) or comment here in case of any question.

Thanks @dr0ptp4kt. I was woking with the revision API where I wanted to get content for all the pages using a generator. But the API doesn't seem to return revision content for most pages.
Plus I wanted to get only the lastest revision content, but that seems to be possible only for single page queries. A little help here.

gengh added a subscriber: gengh.Dec 16 2020, 1:46 PM
DVrandecic triaged this task as High priority.Dec 16 2020, 6:05 PM
DVrandecic moved this task from To triage to Phase γ on the Abstract Wikipedia board.