Page MenuHomePhabricator

Analyze community authored functions that build Wikipedia infoboxes and more
Closed, ResolvedPublic

Description

Brief summary

The Abstract Wikipedia initiative will make it possible to generate Wikipedia articles with a combination of community authored programming functions on a "wiki of functions" and the data and lexicographic (dictionary, grammar, etc.) knowledge on Wikidata .

Today the way community authored programming functions are used on different language editions of Wikipedia involves a lot of copying and pasting. If someone wants to calculate the age of someone for a biography in their native language, they may need to first go to English Wikipedia for example, find the community authored programming function that calculates ages in English, then copy and paste it to their non-English Wikipedia. This process is error prone, can lead to code duplication, and worse, improvements to functions on one language edition may not ever make their way to other language editions.

Wouldn't it be easier if all of these functions were instead available centrally and people didn't have to go through this manual process?

This Outreachy task is about an important first step: finding the different community authored functions that are out there and helping to prioritize which ones would be good candidates for centralizing for Abstract Wikipedia and its centralized wiki of functions.

Skills required

We think the successful Outreachy intern will know Python or potentially Scala or R.

The project

You will write code to:

  1. Fetch the different community authored functions (also known as "modules" and "templates") on the wikis, and determine their usage in articles and how many pageviews use each community authored function.
  1. Analyze the similarity between community authored functions hosted across different projects (i.e., looking for redundant or very similar code). This likely requires wireup of some open source packages, meaning the Python (or Scala or R) code probably needs to import some libraries that are good at looking at code similarity. But there are several potential approaches and this is part of the fun and the challenge.
  1. If there's enough time, determine whether there are segments of code that can be turned into pure functions in the wiki of functions, which is an interesting problem domain in computer science. This would likely require wireup with a Lua programming language interpreter (probably also using the wiki software Wikimedia maintains, MediaWiki) and some degree of manual spot checking to verify that identification of stuff for pure functions is correctly identified.

During this project, you get to write open source code, and you'll publish your methodology and a report that will become a subpage of the Abstract Wikipedia project page, which will be shared with community volunteers to aid in prioritization of things to turn into functions. The report can also potentially form the basis for a publishable research paper.

Possible mentor(s)

@DVrandecic, founder of Wikidata and project lead of Abstract Wikipedia at the Wikimedia Foundation
@dr0ptp4kt , Engineering Director at the Wikimedia Foundation

Microtask

To show your interest, we encourage you to try to solve this problem:

  1. Write a script that fetches all of the source code on English Wikipedia in the Module: "namespace". Hint: we have APIs that will make your life easier and you can find good examples of how to call the APIs by using your favorite search engine or looking at some of the tools on Toolforge. Please try to limit the number of page or API fetches to one per second. We also have dumps with the same content that can analyzed offline.
  2. Generate a summary report that includes interesting statistics like number of modules, a histogram of file sizes, and so on.

Related Objects

Event Timeline

dr0ptp4kt renamed this task from Analyze user-created modules that generate Wikipedia infoboxes and more to Analyze community authored functions that generate Wikipedia infoboxes and more.Sep 23 2020, 6:37 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt changed the visibility from "Public (No Login Required)" to "Custom Policy".Sep 23 2020, 6:40 PM
dr0ptp4kt renamed this task from Analyze community authored functions that generate Wikipedia infoboxes and more to Analyze community authored functions that build Wikipedia infoboxes and more.Sep 28 2020, 11:46 AM
srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Oct 7 2020, 6:19 PM

Hi, I am Udokaku Ugochukwu, an outreachy applicant for the December cohort, I would like to contribute to this project. I have started the microtask for this project.
May I take up this issue

Hi @Udoka_Ugo. Feel free to post your solution to the microtask to GitHub. Thanks!

okay, thanks @dr0ptp4kt. Please can I get a link to the repository??? I think I am almost done with the microtask.
Do I drop the link to the repository here

Regarding the microtask there is no existing repository, as you are supposed to write a standalone script and summary report on your own.

ok, Thanks Aklapper,
I have done the microtask and sent to a repo I created on my github account.
How would it be reviewed, please.
Also, I would like to know what's next after the microtasks. Thank you

Please provide links, otherwise nobody will be able to find things to review. Thanks. :)
For the general process and steps, please see https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps

Okay thanks
Here is the link to my microtask, https://github.com/UdokaVrede/Wikimedia-Microtask
Please let me know of any modifications I would be required to make in this report.
Thank you

Okay thanks
Here is the link to my microtask, https://github.com/UdokaVrede/Wikimedia-Microtask
Please let me know of any modifications I would be required to make in this report.
Thank you

Hi @Udoka_Ugo, you're going in the correct direction!

The opensearch API returns an incomplete set of pages, so you should try a different API such as https://www.mediawiki.org/wiki/API:Allpages . Note also that you'll need to use continuation (also known as "pagination") to get all of the results, as each API response can only return a subset of the article names. Try making a request using https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=allpages&apnamespace=828&aplimit=max and then after getting the first result set, try setting the apcontinue value to the value of the "continue" field in the first result set, then notice how you can keep repeating that pattern to get more and more records, in batches of 500 records. That's the API Sandbox, but you should be able to adapt the examples into your Python script.

I recommend labeling the axes on the matplotlib report, and it's worth considering if there are additional interesting statistics you might want to generate (at least if you have time to do more statistics).

Alright thanks I'd do that. Sorry for the late reply, I got a broken toe and had to visit the meds.

Hi, I am an outreachy applicant and interested in joining this project. I went through the task and I will get started with it right away.
Just need a little clarification, are we all going to solve the same task or are there other I have to look at?
Thanks :)

Edit: Ran into few questions. I cannot figure out what pageviews means, is it simply count of number of times this page was visited? If so, how can I find how and when those functions were used? To create a usage statistics sort of thing.

Hi @tanny411, the Microtask is the same for all. You'll want to start with the Microtask first, and we'll select the intern based on the Microtask entries.

Regarding "pageviews", yes, this refers to the number of times a page has been read. There are several ways to obtain pageview information (cf. https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews ), but that's not necessary for the Microtask.

Thanks for checking in and please do email me directly at abaso _at_ wikimedia [.] org (trying to avoid spambots here with this email string) or comment here in case of any question.

Thanks @dr0ptp4kt. I was woking with the revision API where I wanted to get content for all the pages using a generator. But the API doesn't seem to return revision content for most pages.
Plus I wanted to get only the lastest revision content, but that seems to be possible only for single page queries. A little help here.

Is everything in this project task planned for Outreachy (Round 21) completed? If yes, please consider closing this and other related tasks as resolved. If bits and pieces are remaining, you could consider creating a new task and moving them there.

@dr0ptp4kt @LostEnchanter
I am closing this task. Some subtasks are pending (not priority), should we move these to a new task? Plus additional subtasks will be created for project improvement as well.

@Aklapper do you think it would make more sense to get a separate board for those tasks and put them in Stalled mode?

If not, I think it would be sufficient to just link to the tasks from the meta page for things that are outstanding, but mark them as Declined for now to signal that they're not getting active work. I do hope some day they can get some additional work done on them, to be sure! But also gotta be realistic with folks' current school and work assignments.

@dr0ptp4kt Not sure I know enough context here, HTH: "Declined" status means "this shall never be worked on" which does not seem to apply here; "not getting active work" is signaled by no task assignee set; "stalled" status means can currently not be acted on by anyone (staff and volunteers) for a reason that is not lack of resources in some team. I generally recommend to allow a team to [not] have some tasks on their workboard while also allowing everyone else to see tasks about a specific codebase (which is unrelated to some team), see T271292: Phab PM: Document why not to only use team project tags but also codebase project tags.