//(Microtask regarding a possible Outreachy internship in data analysis with the WMF reading team)//
Download the dataset described here:
https://lists.wikimedia.org/pipermail/wiki-research-l/2016-April/005129.html
It consists of a list of all sections from all pages on the English Wikipedia (255MB packed / 1.4GB unpacked).
[x] Generate the list of the 100 most frequent section titles, each with the total number of how often they occurred in the dataset. Disregard leading and trailing spaces (" ") in a section title, e.g. treat "Early life" and "Early life " as identical.
* --> [[https://github.com/zareenfarooqui/wmf-outreachy-microtask/blob/master/WMF%20-%20Outreachy%20Microtask.ipynb| code + results]]
[x] Bonus task: Generate the list of the 100 section titles that are used in the largest number of articles, each with the percentage of articles that contain such a section. Caveat: Not all pages are articles. For example, there are also talk pages and help pages (see https://en.wikipedia.org/wiki/Wikipedia:Namespace ). For the purposes of this bonus exercise, let's make the simplifying assumption that all entries in the dataset with no "/" and no "." in the page title correspond to articles, and vice versa that all page titles that contain a "/" or "." do not correspond to an article.
* --> [[https://github.com/zareenfarooqui/wmf-outreachy-microtask/blob/master/WMF%20-%20Outreachy%20Microtask.ipynb| code + results]]
[ ] Extended task (after completion of the above): Exact calculation of the 100 section titles that are used in the largest number of articles, for five large Wikipedias, by regenerating the dataset on PAWS instead of relying on the above simplification.
(Background: This came up in context of the Related Articles feature, where we wanted to know how many Wikipedia articles already contain a "See also" section, because it fulfills a somewhat similar function.)