Page MenuHomePhabricator

Outreachy Round 22 Microtask : Complete PAWS notebook tutorial
Closed, ResolvedPublic

Description

This task is related to: https://phabricator.wikimedia.org/T276270

Overview

For this task, you've been given an outline for a notebook-based tutorial: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Outreachy%20Summer%202021/Wikipedia_Edit_Tags.ipynb.

Using your knowledge of technical writing best practices and data analysis, complete the tutorial. Your audience is people who are new to Wikimedia projects; they may require further context to help them understand what the tasks are for and what they are useful for.

You are encouraged to add context and experiment with the structure of the document.

Set-up

  • Make sure that you can login to the PAWS service with your wiki account: https://paws.wmflabs.org/paws/hub
  • Using this notebook as a starting point, create your own notebook (see these instructions for forking the notebook to start with) and complete the functions / analyses. All PAWS notebooks have the option of generating a public link, which can be shared back so that we can evaluate what you did. Use a mixture of code cells and markdown to document what you find and your thoughts.
  • As you have questions, feel free to add comments to this task (and please don't hesitate to answer other applicant's questions if you can help)
  • If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to your mentor with the link to your public PAWS notebook. We will try to make time to give this feedback once to anyone who would like it.
  • When you feel you are happy with your notebook, you should include the public link in your final Outreachy project application as a recorded contribution. We encourage you to record contributions as you go as well to track progress.

Additional resources

For more resources about technical writing for Wikimedia projects see:

For more resources about research and data on Wikimedia projects see:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
srishakatux changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".
srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 29 2021, 4:18 PM

hi @Isaac and @srodlund, I am Palak and I am applying for outreachy this year with wikimedia.
I would like to work on this microtask.

Hello @Isaac and @srodlund,
My name is Doan. I'm an applicant of Outreachy this year.
I'm interest with Wikimedia project and I would like to start with https://phabricator.wikimedia.org/T276270.
Hope to receive guidance from you

Hi and welcome everyone! Please see https://www.mediawiki.org/wiki/Outreachy/Participants and https://www.mediawiki.org/wiki/New_Developers#Communication - thanks! :)

@Aklapper agreed but is there something in particular you think is covered there that is being missed or was that a general reminder? For context, I think it's quite fine to indicate interest in the task here (#7 in the Participants page above) as some have been doing and also would emphasize point #6 about being specific when asking questions as this will help us (and other participants) in providing support. I will note that some of the suggestions in those documents do not apply -- notably, no one should assign this task to themselves as it's open for all applicants at this stage.

Ah, that was a general reminder, sorry!

Ah, that was a general reminder, sorry!

No worries -- thanks for clarifying!

Hi @Isaac,
I just started with the task and
https://public.paws.wmcloud.org/User:Palak199/Wikipedia_Edit_Tags.ipynb
here,
in[18] i tried to get the data with that particular tag, but I am not sure if it is the right way?
would you please guide me on it

i tried to get the data with that particular tag, but I am not sure if it is the right way? would you please guide me on it

@Palak199 can you ask a more specific question? In particular, because that public link will update as you continue to code, copying in code snippets where relevant is also useful.

@Isaac yes sure !
so I was doing this todo Loop through TAG_DESC_DUMP_FN and identifies the ctd_id associated with mobile edit
And so far I have understood upto this point

import gzip  # necessary for decompressing dump file into text format
import mwxml  # necessary for processing Wikipedia XML history dumps
LANGUAGE = 'simplewiki'
SITENAME = LANGUAGE.replace('wiki', '.wikipedia')
DUMP_DIR = "/public/dumps/public/{0}/latest/".format(LANGUAGE)
TAG_DESC_DUMP_FN = '{0}-latest-change_tag_def.sql.gz'.format(LANGUAGE)
# TODO: Loop through TAG_DESC_DUMP_FN and identifies the ctd_id associated with `mobile edit`
# For additional tags and extended descriptions on Simple English Wikipedia, see: https://simple.wikipedia.org/wiki/Special:Tags
# The Python gzip library will allow you to decompress the file for reading: https://docs.python.org/3/library/gzip.html#gzip.open

!zcat "{DUMP_DIR}{TAG_DESC_DUMP_FN}" | head -n 200 | sed 's/(/\n &/g'| grep "mobile"

This is the corresponding output.

(5,'mobile edit',0,205701),
(6,'mobile web edit',0,198656),
(26,'mobile app edit',0,6087),
(107,'advanced mobile edit',0,9954),

Is this right way?
And I am not sure how do I get only the ids. In the next TODO we have to use these ids to filter data from other table.

Hey @Palak199 thanks for the additional details. A few thoughts that hopefully help:

  • It's the mobile edit tag in particular that I want you to focus on -- i.e. only tag ID #5 so you can ignore mobile web edit, mobile app edit, and advanced mobile edit. If you want to understand the differences between these, you can look at the descriptions on Simple Wkipedia's tag page: https://simple.wikipedia.org/wiki/Special:Tags
  • The revision IDs associated with any particular tag can be found in the file whose name is stored in the TAG_DUMP_FN parameter. You'll have to write some basic code to loop through that file and extract the revision IDs associated with the mobile edit tag. You can see details of that file earlier in the notebook to help you understand how this file is formatted.

Hey all, I'm Christalee and I also plan to work on this task! I've started going through the notebook and when I reach this cell, my code takes a very long time to run:

# TODO: Loop through the the history dump and record how many mobile vs. non-mobile edits were made in each year

Can I get an estimate of how long it should take an optimized solution? Or suggestions on how to approach this problem? I'm willing to post what I have so far if that would help. Thanks!

Can I get an estimate of how long it should take an optimized solution? Or suggestions on how to approach this problem? I'm willing to post what I have so far if that would help.

Welcome @Christalee_b -- a few thoughts:

  • Looping through the history dump for simplewiki should take around 30 minutes.
  • It'll go a good bit faster as well (~20 minutes) if you only process pages in the article namespace (0). Every page on Wikipedia is associated with a namespace and this information is surfaced by the mwxml library so you can filter on it. This will discard pages with very long edit histories like some talk pages.
  • If it's still taking a long time, then there's probably something in your code that is slowing down the processing. My suggestion for figuring out what's happening is to calculate how long it takes to process e.g., the first 500 or 1000 pages. Then go through the code and comment out parts and rerun and compare the time. This should help you quickly identify what part of the code is so slow (there are other, more formal ways too) and then you can think about why it's slow and how you could speed it up.

Hi everyone, my name is Zhansaya. I have a question regarding the documentation: should I explain every line of my code or is it OK to give a detailed summary of 10-15 lines? Thanks in advance!

Hi everyone, my name is Zhansaya. I have a question regarding the documentation: should I explain every line of my code or is it OK to give a detailed summary of 10-15 lines? Thanks in advance!

Welcome Zhansaya! There's no strict guidelines but generally a concise summary for a given cell in your notebook is sufficient. You can always add Markdown cells if you need more as well. Most style guides essentially say that you don't want to describe the code -- i.e. you can assume the person accessing the tutorial knows basic Python and can figure out what each line does -- but do give a sense of the goal of the code or any unclear choices you might have made in the code.

Checking and not checking namespace(0) give two different figures on edits. I believe that this is a normal behavior but just want to make sure that I am not missing anything important (i.e. it is acceptable to analyze edits only in the article namespace)?

Hi @Isaac!

What is the reason for the revision id (ct_rev_id) field in the change table being NULL? I was under the impression that each edit would necessarily be associated with a revision id, but this is not the case for ~3.8% of the mobile edits.

@Isaac

Another question, not all that relevant for the microtask but would help me understand the larger project scope and ecosystem -

How would someone interested in analysing or otherwise building a pipeline around wiki data normally interact with it? Is accessing it through db dumps the typical way, or would it be more common to connect directly to a db through Toolforge?

Checking and not checking namespace(0) give two different figures on edits. I believe that this is a normal behavior but just want to make sure that I am not missing anything important (i.e. it is acceptable to analyze edits only in the article namespace)?

Yep -- it would not be surprising that the numbers would be different (as many editors just edit articles namespace 0 and don't bother with talk pages or other namespaces). Feel free to just proceed w/ namespace 0 for your analysis but if you still have the data on the difference, it could be interesting to include.

What is the reason for the revision id (ct_rev_id) field in the change table being NULL? I was under the impression that each edit would necessarily be associated with a revision id, but this is not the case for ~3.8% of the mobile edits.

@Slst2020 interesting -- you'd have to investigate further probably to determine what exactly is going on but it's likely one of these two things:

  • How edits etc. are tracked has changed overtime and it might just be that these are older edits and so the data isn't all there.
  • Sometimes revisions are deleted. You can see an example of this on this page (the 6 March 2021‎ edit by ANANTSAINILAST). This might result in the revision ID being removed from the history but not comments/tags etc.

How would someone interested in analysing or otherwise building a pipeline around wiki data normally interact with it? Is accessing it through db dumps the typical way, or would it be more common to connect directly to a db through Toolforge?

Good question -- it varies (and others probably have different perceptions):

  • The databases on PAWS / Toolforge are kept very close to the current state of the wikis so if you need very recent data, they are good for that. They can handle larger queries decently well but if you're trying to do fancy things with the revision text or really large joins, they might time out.
  • The APIs are very quick, current, and handle small queries quite well -- e.g., all revisions for a page or by a given user -- but aren't good for collecting data in bulk.
  • The dumps are good if you want to process *all the data* but don't mind if it takes a little while. They also keep revisions in order by page so if you're examining how the history of a page has changed, they make that easy. Users of PAWS / toolforge are less likely to use the dumps because the local database access handles many of those needs, but many researchers download the dumps locally and work with them offline so these tutorials can be very helpful for them.

I've reached the section working with mwapi and I'm trying to pull down revisions from a page with many edits, requiring use of the Session continuation feature as follows:

params = {'action':'query',
          'prop':'revisions',
         'titles': 'COVID-19 pandemic',
         'rvprop': 'ids|timestamp|user|tags',
          'rvdir': 'newer',
         'rvstart': datetime.datetime(2020, 1, 1),
         'rvend': datetime.datetime(2021, 1, 1)}

query = session.get(params)
query_c = session.get(query_continue=query['continue'], continuation=True, params=params)

for x in query_c:
    print(x)

This results in a JSONDecode error, possibly because an HTML (error?) page is being returned instead of JSON. Has anyone tried this successfully? Relevant docs

Hello @Isaac @srodlund and everyone! My name is Joan, and I am a 2021 Outreachy applicant. I am interested in working on this Wikimedia microtask. This is my first time working on open source. I hope to collaborate well with everyone and learn a lot along the way.

Hi! Isaac and sarah, I'm Daneshwari and I'm looking to contribute to this project for Outreachy 2021. Seems that I started a bit late but I would contribute to my full strength. Thanks!

I'm currently working on #TODO: Loop through TAG_DESC_DUMP_FN and identifies the ctd_id associated with mobile edit and I have completed the task using .sql file by approach 1. But I later recognized that my approach might be wrong.

  1. Using .sql files, regex and basic python to find 'mobile edit' tag.
  2. Actually creating a database and all the tables from .sql files and querying on them, since the researchers who would work on the data will access the database. So, this code with the queries has to there in tutorial.

I wanted to confirm which approach we have to follow for our documentation, so that it is followed throughout the tutorial ? My question might seem childish but I wanted to confirm before proceeding further.

If tags field in JSON is empty, should I count the revision as non-mobile edit or just ignore?

This results in a JSONDecode error, possibly because an HTML (error?) page is being returned instead of JSON. Has anyone tried this successfully?

@Christalee_b the documentation isn't great for that but here's a better example of how to do continuation that will hopefully solve your problem: https://github.com/mediawiki-utilities/python-mwapi#query-with-continuation

My name is Joan, and I am a 2021 Outreachy applicant. I am interested in working on this Wikimedia microtask.

Welcome and good luck @Joanastasia

  1. Using .sql files, regex and basic python to find 'mobile edit' tag.
  2. Actually creating a database and all the tables from .sql files and querying on them, since the researchers who would work on the data will access the database. So, this code with the queries has to there in tutorial.

@DaneshwariK welcome. If you had actually done approach #2, that would have been fine, but you are right that we were expecting something like approach #1.

If tags field in JSON is empty, should I count the revision as non-mobile edit or just ignore?

@Zhansayaa yes, you can assume that was a desktop edit. There are no explicit tags for desktop so it's the default.

@Isaac Thank you so much! I can't believe I misread the documentation so thoroughly D:

Are we allowed to install additional modules using Paws terminal?

I can't believe I misread the documentation so thoroughly D:

No worries -- this is why tutorials are useful :)

Are we allowed to install additional modules using Paws terminal?

@Zhansayaa sure. You should be able to pip install packages via the notebook and have access to them in your notebook.

@Isaac and @srodlund, if I email you the finished notebook and a draft of my final application by April 16, would that give you enough time to get back to me with feedback in time before the final deadline?

@Isaac and @srodlund, if I email you the finished notebook and a draft of my final application by April 16, would that give you enough time to get back to me with feedback in time before the final deadline?

@Slst2020 Yes. This should be enough time for me. I'll be looking at it more from a technical documentation perspective.

Hi everyone,

First of all, thanks for your interest in this project! I wanted to share some additional resources about technical documentation that you may find helpful when you designing your tutorial.

We don't yet have a tutorial template just for Jupyter notebooks. This is something we will work on (or maybe include in the scope of the Outreachy project) but for now, this is okay. We'd like to see how you would approach formatting your tutorials, so we can better understand your approach to documentation overall.

While you're working on your tutorial, you may find some of the following resources useful:

Finally, here are some notebook tutorials for a variety of Wikimedia projects. It may be helpful to take a look at them to see how they are formatted.

@srodlund Thanks for all the links and resources, that's very helpful.

Does anyone know if Wikimedia SQL dumps always contain a single table, or if there can be several in the same file?

Does anyone know if Wikimedia SQL dumps always contain a single table, or if there can be several in the same file?

@Slst2020 the SQL dumps should only ever have a single table in them though obviously some tables are small while others are much larger.

Hi everyone,

First of all, thanks for your interest in this project! I wanted to share some additional resources about technical documentation that you may find helpful when you designing your tutorial.

We don't yet have a tutorial template just for Jupyter notebooks. This is something we will work on (or maybe include in the scope of the Outreachy project) but for now, this is okay. We'd like to see how you would approach formatting your tutorials, so we can better understand your approach to documentation overall.

While you're working on your tutorial, you may find some of the following resources useful:

Finally, here are some notebook tutorials for a variety of Wikimedia projects. It may be helpful to take a look at them to see how they are formatted.

Thanks @srodlund for these resources. They are really helpful!

Does anyone know if Wikimedia SQL dumps always contain a single table, or if there can be several in the same file?

@Slst2020 the SQL dumps should only ever have a single table in them though obviously some tables are small while others are much larger.

Regarding this I also wanted to confirm one thing. When I'm working this file, DUMP_DIR+TAG_DUMP_FN , I encounter more than 'Insert into' statements. Is it with only me, am I working with the correct file?

doubt1.JPG (878×1 px, 430 KB)

Hi @DaneshwariK! SQL dump files contain everything needed to restore a db from a file, including insert into statements but also info about the schema and other metadata. Not quite sure what you mean by "I encounter more than 'Insert into' statements."? If you refer to the metadata, then yes, this is normal.

@Isaac In 'Example Analyses of Edit Tag Data' part. What is dataset to read and plot? Where I can find it?
If you have any example about this let me know.
Thanks

All: I forgot to mention but there are currently holidays today (Friday) - Monday so responses will be a bit slower than usual. I'll do my best to check occasionally though and respond if it is quick.

What is dataset to read and plot? Where I can find it? If you have any example about this let me know.

@Doanvd it is up to you to gather this dataset using the methods you build in the prior sections of the notebook. It is open-ended so feel free to work with the mobile tag edits or find another tag to examine that you think might be interesting. Hope that helps.

Hi everyone,
I was trying to open the xml file with mwxml

dump = mwxml.Dump.from_file(open("DUMP_DIR+HISTORY_DUMP_FN"))

But I get the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

How do I find the encoding ?

I believe the file is .gz so you need to open it with gzip.open().

I believe the file is .gz so you need to open it with gzip.open().

Yes! this works like a charm. Thank you :)

file = gzip.open(DUMP_DIR+HISTORY_DUMP_FN)
dump = mwxml.Dump.from_file(file)
  • while looping through the data fetched from API, I can see multiple tags associated with a particular revision id for e.g.
'tags': ['mobile edit', 'mobile web edit', 'mw-reverted']

Now when I am supposed to compare the mobile tags, should I include the revisions with 'mobile edit' + any other tags?

  • get all edits (and tags) for the article you chose for all of 2020 via the API in this todo, if I store the revision id, comments and tags for every revision would that be sufficient?

Please guide me on it. Thanks :-)

if I store the revision id, comments and tags for every revision would that be sufficient?

@Palak199 you're comparing the API results with the data you gathered from the dumps, so I'd just gather whatever information you stored from the dumps so that you can directly compare the two.

@Palak199 you're comparing the API results with the data you gathered from the dumps, so I'd just gather whatever information you stored from the dumps so that you can directly compare the two.

Ok noted! I had another question too. In the task where we have to choose an article and find its mobile edits, I saw that there is a history dump file in the server directory and we have to loop through whole of it to find a particular title. Is there any way we can convert an article like this one https://simple.wikipedia.org/w/index.php?title=india&action=history&year=2020 into its own independent XML dump file and use it?

Is there any way we can convert an article like this one https://simple.wikipedia.org/w/index.php?title=india&action=history&year=2020 into its own independent XML dump file and use it?

@Palak199 I've never tried it so don't know if it would work, but there is an export tool that might do this: https://simple.wikipedia.org/wiki/Special:Export
If you use this in your notebook, just make sure you explain clearly how to reproduce the results.

Hello everyone! I am Nirali, an applicant for Outreachy and I wanted to contribute to this project as well. :)

However, I am having some doubts regarding looping through the history dump and finding the revisions of a particular article.
The following is my code snippet: ("ids" - list of all the revision ids associated with mobile-edits)

dump = mwxml.Dump.from_file(gzip.open(DUMP_DIR + HISTORY_DUMP_FN))

count_dump = 0
for page in dump:
    if page.title == 'Bill Gates':
        print("Page title: " + page.title)
        for revision in page:
            if str(revision.timestamp).split('-')[0] == '2020':
                if revision.id in ids:
                    count_dump += 1
print("Total mobile edit count = %s" % count_dump)

The output is:

Page title: Bill Gates
Page title: Bill Gates
Page title: Bill Gates
Total mobile edit count = 0

Problem:

  1. In the code, I print the page's title every time the page with the title of 'Bill Gates' is encountered. There happens to be multiple pages with the same title where all of them seem to be different pages since the revision ids associated with them are different.
  2. Also, the count of mobile edits is 0 even if it was non-zero when I found it using the API.

Both of these problems occurred for all the 4 to 5 articles that I tried. Can someone help me figure out where I might be going wrong?

And thank you @srodlund for the great resources! They have been really informative and helpful.

Thank you, in advance :)

Hi! @Nizz009 For once you can try to print page IDs to see whether these are same or different pages and using these page IDs see what revisions IDs are there in that page. I feel this will help to solve the problem to some extent.

Hello @DaneshwariK ! Thank you for the suggestion. However, both, the page IDs and the revision IDs associated with each page are different.
The following is just a part of the output that I received: (The numbers below the Page id are the revision IDs in that page)

Page id: 12548
48407
1207888
2713469
2738282
3579930
... (continued)
Page id: 107672
20777
23700
25149
25150
25151
... (continued)
Page id: 413520
4534153

(Apparently, the last page contains only one revision ID)

@Nizz009

Both of these problems occurred for all the 4 to 5 articles that I tried. Can someone help me figure out where I might be going wrong?

You can check if the datatype of items of your ids matches that of revision.id

In my case mismatched datatype was the reason of getting 0 result

@Isaac and @srodlund
TODO: Loop through the the history dump and record how many mobile vs. non-mobile edits were made in each year
Just to confirm, in this todo are we supposed to output something like

in year 2001 :
mobile edits = 12
non mobile= 1

in year 2002:
mobile edits =1
non mobile=11

and so on

till 2020

Hi @Palak199 ! Thank you! I forgot to convert it to string even if I did so in the code snippet above that. That was silly of me. :(

But shouldn't each article be allotted a single page? I had the impression that there existed only one page for each article which consisted the data of all the revisions.

Can someone help me figure out where I might be going wrong?

@Nizz009 I forgot to mention namespaces in the tutorial template, but this page will explain how they work: https://en.wikipedia.org/wiki/Wikipedia:Namespace
In short -- articles with the same "title" can exist in different namespaces where they serve different purposes. You should focus on namespace 0, which is what we think of as traditional Wikipedia articles. Namespace is an attribute in the XML dump that you can easily access and filter on. In general, when trying to figure these things out, one trick is that you can easily figure out what article is associated with a page ID like this: if the page ID is 413520, then you can see the associated article by going to: https://simple.wikipedia.org/wiki/?curid=413520. Good catch!

Just to confirm, in this todo are we supposed to output something like

@Palak199 yes, # of mobile vs. non-mobile edits by year.

I forgot to mention namespaces in the tutorial template, but this page will explain how they work: https://en.wikipedia.org/wiki/Wikipedia:Namespace
In short -- articles with the same "title" can exist in different namespaces where they serve different purposes. You should focus on namespace 0, which is what we think of as traditional Wikipedia articles. Namespace is an attribute in the XML dump that you can easily access and filter on. In general, when trying to figure these things out, one trick is that you can easily figure out what article is associated with a page ID like this: if the page ID is 413520, then you can see the associated article by going to: https://simple.wikipedia.org/wiki/?curid=413520. Good catch!

Thank you very much for the clarification and the resource! It was really helpful.

Hi @Isaac and @srodlund I'm about to finish my tutorial notebook. Only thing left is- how can I make this content readable, this is revision diff from Craig Noone article:
https://en.wikipedia.org/w/index.php?title=Craig_Noone&diff=933163076&oldid=930870273

{{Use dmy dates|date=June 2013}}\n{{Use British English|date=June 2013}}\n{{infobox football biography\n| name = Craig Noone\n| image = CraigNoone.jpg\n| image_size = 175\n| caption = Noone playing for [[Cardiff City F.C.|Cardiff City]] in 2012\n| fullname = Craig Stephen Noone<ref name=Hugman>{{Hugman|24751|accessdate=12 May 2019}}</ref>\n| birth_date = {{birth date and age|1987|11|17|df=y}}<ref name=Hugman/>\n| birth_place = [[Kirkby]], England<ref>{{cite book|editor1-first=Glenda|editor1-last=Rollin|editor2-first=Jack|editor2-last=Rollin|title=Sky Sports Football Yearbook 2012\u20132013|year=2012|publisher=[[Headline Publishing Group|Headline]]|location=London|isbn=978-0-7553-6356-8|page=440|edition=43rd}}</ref>\n| height = {{height|m=1.78}}\n| position = [[Midfielder#Winger|Winger]]\n| currentclub = [[Melbourne City FC|Melbourne City]]\n| clubnumber = 11\n| years1 = 2005\u20132007 | clubs1 = [[Skelmersdale United F.C.|Skelmersdale United]] | caps1 = | goals1 =\n| years2 = 2007\u20132008 | clubs2 = [[Burscough F.C.|Burscough]] | caps2 = 24 | goals2 = 4\n| years3 = 2008 | clubs3 = [[Southport F.C.|Southport]] | caps3 = 1 | goals3 = 0\n| years4 = 2008\u20132011 | clubs4 = [[Plymouth Argyle F.C.|Plymouth Argyle]] | caps4 = 55 | goals4 = 5\n| years5 = 2009 | clubs5 = \u2192 [[Exeter City F.C.|Exeter City]] (loan)

I got this while querying using Revision API. This is not html so a html parser(via BeautifulSoup) can't be used. So how can I get the text part and make this to a readable format ?

hey @DaneshwariK

I got this while querying using Revision API. This is not html so a html parser(via BeautifulSoup) can't be used. So how can I get the text part and make this to a readable format ?

@Isaac shared this documentation earlier in this thread. I used mwapi library and example quoted in this documentation for reference
https://github.com/mediawiki-utilities/python-mwapi#query-with-continuation
Hope it helps :')

Only thing left is- how can I make this content readable, this is revision diff from Craig Noone article:

@DaneshwariK I'm not sure if this is what you're asking, but the Compare API will provide you with HTML diffs for edits -- e.g., https://en.wikipedia.org/w/api.php?action=compare&fromrev=930870273&torev=933163076
The diff that you pasted there is wikitext, which is the raw code that is used for writing Wikipedia articles. It must be parsed then into HTML to be "readable". You can also just screenshot diffs in the visual mode on Wikipedia if that's easier and what you want -- e.g., https://en.wikipedia.org/w/index.php?title=Craig_Noone&diff=933163076&oldid=930870273&diffmode=visual

Only thing left is- how can I make this content readable, this is revision diff from Craig Noone article:

@DaneshwariK I'm not sure if this is what you're asking, but the Compare API will provide you with HTML diffs for edits -- e.g., https://en.wikipedia.org/w/api.php?action=compare&fromrev=930870273&torev=933163076
The diff that you pasted there is wikitext, which is the raw code that is used for writing Wikipedia articles. It must be parsed then into HTML to be "readable". You can also just screenshot diffs in the visual mode on Wikipedia if that's easier and what you want -- e.g., https://en.wikipedia.org/w/index.php?title=Craig_Noone&diff=933163076&oldid=930870273&diffmode=visual

Thanks! for this. I surely refer this.

  1. I also had one thing in mind. Since there are already wikipedia articles in english on en.wikipedia, and if people search for articles they may come across en.wikipedia articles frequently on search engine results, so why simple.wikipedia was required?
  1. Also, in the history XML dump that has revision history of simplewiki items(from where I got the item titles with their namespaces) , when I used API to get revision tags I found that the revision history was low and particularly low for 'mobile edit' tag. So, simplewiki items if used for analysis would not yield good results.

    Where can I find results about which wiki has most edits per page, or which wiki is most frequently visited and edited?

so why simple.wikipedia was required?

We went with Simple Wikipedia because the size is much smaller than English Wikipedia so it was more reasonable that you could process it via these notebooks.

Where can I find results about which wiki has most edits per page, or which wiki is most frequently visited and edited?

This page has links to a bunch of data on wiki sizes etc.: https://meta.wikimedia.org/wiki/List_of_largest_wikis

We went with Simple Wikipedia because the size is much smaller than English Wikipedia so it was more reasonable that you could process it via these notebooks.

Got it. Even the notebook on PAWS that I'm working on takes time to load because of its size.

This page has links to a bunch of data on wiki sizes etc.: https://meta.wikimedia.org/wiki/List_of_largest_wikis

Thanks! That's really helpful.

Hi @srodlund & @Isaac, could you please confirm that you both got my email with the notebook?

Cheers.

Hi @srodlund & @Isaac, could you please confirm that you both got my email with the notebook?

@Slst2020 thanks for letting us know. It went to my spam folder and I had not checked. I will try to get you the review by Monday.

Greetings everyone, my name is Tambe Tabitha and I am contributing to this task with you.

When I loop through the History dump, the notebook produces this error:

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

This suggestion on stackoverflow says that I should change the limit as such

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

@srodlund @Isaac are we allowed to do this change? if so, where do we input the change?.

I have noticed that when I break the for loop after processing just one page, it works as expected but it only fails if I let the program run on the full dataset.

When I loop through the History dump, the notebook produces this error:

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

In my case, removing excess/bulky print statements helped. It still takes 15-20 mins to run but doesn't exceed the data rate limit. Though I don't know about changing rate_limit_window

In my case, removing excess/bulky print statements helped. It still takes 15-20 mins to run but doesn't exceed the data rate limit. Though I don't know about changing rate_limit_window

Thank you for that suggestion. I actually read about that here but my loop does not have any print statements. I only print at the end of the loop.

What I do it that for each page in the dump, I filter out for namespace(0) and then loop through the revisions of that page, extracting the year value from time stamp and appending it to a sequence.
I only print when the loop if fully done. The code does run for a long time before producing that error.

In my case, removing excess/bulky print statements helped. It still takes 15-20 mins to run but doesn't exceed the data rate limit. Though I don't know about changing rate_limit_window

Thank you for that suggestion. I actually read about that here but my loop does not have any print statements. I only print at the end of the loop.

What I do it that for each page in the dump, I filter out for namespace(0) and then loop through the revisions of that page, extracting the year value from time stamp and appending it to a sequence.
I only print when the loop if fully done. The code does run for a long time before producing that error.

This has been resolved now. Thank you.
What happened was that I tried to print the full results of the loop, all at once, at the end of the program and it contained too many values. It has been a wiser decision to loop over the result instead. Thanks again.

Hi @srodlund & @Isaac,

A couple questions regarding the final application -

On the Outreachy 'final application' form there's the "(Optional) Community-specific Questions" section - do you have any specific requirements for this part?

Also, should we create an application on Phabricator or is that just for GSoC? I've heard some mentors on other projects recommend it to Outreachy applicants even though it's not an official requirement.

I'd like to write my application in Google docs or Markdown format - would it be ok to do that and add a link? It'd be easier to read due to better formatting, possibility to add images, etc.

Thanks!

At @Slst2020

A couple questions regarding the final application -

On the Outreachy 'final application' form there's the "(Optional) Community-specific Questions" section - do you have any specific requirements for this part?

We do not have any Community-specific questions.

Also, should we create an application on Phabricator or is that just for GSoC? I've heard some mentors on other projects recommend it to Outreachy applicants even though it's not an official requirement.

You do not need to create an application on Phabricator. Just do the application through Outreachy.

I'd like to write my application in Google docs or Markdown format - would it be ok to do that and add a link? It'd be easier to read due to better formatting, possibility to add images, etc.

We would prefer that you use the application form through Outreachy rather than linking out to a Google doc.

@srodlund: Outreachy Round 22 is over. Can this task be resolved?

@srodlund: Could you please answer the last comment? Thanks in advance!

srodlund claimed this task.