Page MenuHomePhabricator

Outreachy Application Task: Tutorial for Wikipedia Page Protection Data
Open, Needs TriagePublic

Description

Overview

Create your own PAWS notebook tutorial (see set-up below) that completes the TODOs provided in this notebook. The full Outreachy project will involve more comprehensive coding than what is being asked for here (and some opportunities for additional explorations as desired), but this task will introduce some of the APIs and concepts that will be used for that full task and give us a sense of your Python skills, how well you work with new data, and documentation of your code. We are not expecting perfection -- give it your best shot! See this example of working with Wikidata data as an example of what a completed notebook tutorial might look like.

Set-up

  • Make sure that you can login to the PAWS service with your wiki account: https://paws.wmflabs.org/paws/hub
  • Using this notebook as a starting point, create your own notebook (see these instructions for forking the notebook to start with) and complete the functions / analyses. All PAWS notebooks have the option of generating a public link, which can be shared back so that we can evaluate what you did. Use a mixture of code cells and markdown to document what you find and your thoughts.
  • As you have questions, feel free to add comments to this task (and please don't hesitate to answer other applicant's questions if you can help)
  • If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to your mentor with the link to your public PAWS notebook. We will try to make time to give this feedback at least once to anyone who would like it.
  • When you feel you are happy with your notebook, you should include the public link in your final Outreachy project application as a recorded contribution. You may record contributions as you go as well to track progress.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Arfat2396 added a comment.EditedFri, Oct 9, 2:52 PM

Does anyone know which library does !zcat belong to in line 14? I can't find it in gzip documentation here: https://docs.python.org/3/library/gzip.html?highlight=gzip#

@Arfat2396 !zcat is a jupyter inbuilt command. It can also be found on Linux operating system. Mainly used for viewing compressed files without decompressing it.

Arfat2396 added a comment.EditedFri, Oct 9, 2:58 PM

@Arfat2396 !zcat is a jupyter inbuilt command. It can also be found on Linux operating system. Mainly used for viewing compressed files without decompressing it.

So, I can use it in Jupyter with the same rules as in Linux https://www.howtoforge.com/linux-zcat-command/ Thanks!

@Arfat2396 !zcat is a jupyter inbuilt command. It can also be found on Linux operating system. Mainly used for viewing compressed files without decompressing it.

So, I can use it in Jupyter with the same rules as in Linux! https://www.howtoforge.com/linux-zcat-command/ Thanks!

Yes, you can and you're welcome

Perfect evening
I'm really confused and need someone to clarify for me in cell 5
To do an example that loops through all pages and extract data how do I do that as the python docs don't really give example a or docs

Isaac added a comment.Fri, Oct 9, 8:30 PM

Glad to see the questions and thanks @Amamgbu for helping out.

To do an example that loops through all pages and extract data how do I do that as the python docs don't really give example a or docs

@Thulieblack I'd suggest checking out the example notebook I provided. It won't give you everything you need but hopefully you'll be able to figure out how to adapt the code to the data format in the notebook you're working on.

Additionally for all: just a head's up that as we enter the weekend, you should expect responses from us (the mentors) to slow down and possibly have to wait until the week starts again. Monday is also a holiday in the United States (Indigenous People's Day) where I am based, so it is quite likely that responses will be slow then too. In the meantime, you should all still feel free to jump in and help each other. Thanks!

Amamgbu added a comment.EditedFri, Oct 9, 8:31 PM

Perfect evening
I'm really confused and need someone to clarify for me in cell 5
To do an example that loops through all pages and extract data how do I do that as the python docs don't really give example a or docs

You could check out this tutorial on how to create for loops with python (https://www.w3schools.com/python/python_for_loops.asp). I think the idea is to first clean the data before you try to loop. I would be happy to help you out with any issue you face along the way.

Thulieblack reassigned this task from Isaac to Amamgbu.Sat, Oct 10, 6:30 AM
This comment was removed by Thulieblack.

Thanks so much for the pointers,let me try and work it out

Thulieblack removed Amamgbu as the assignee of this task.Sat, Oct 10, 6:33 AM
Aklapper assigned this task to Isaac.Sat, Oct 10, 7:33 AM
Arfat2396 added a comment.EditedSat, Oct 10, 8:03 AM

I'm trying to read the uncompressed data using the gzip library but everytime i run the cell, i get this error, does anyone know what's causing this?
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Update: I found a workaround by iterating and displaying selective parts of data.

Perfect evening
I'm really confused and need someone to clarify for me in cell 5
To do an example that loops through all pages and extract data how do I do that as the python docs don't really give example a or docs

Hey, one way you can do this is to extract the line in the sql dump file with the command "INSERT INTO VALUES" and loop over each record in it. If anyone else have a better approach please feel free to suggest :)

Glad to see the questions and thanks @Amamgbu for helping out.

To do an example that loops through all pages and extract data how do I do that as the python docs don't really give example a or docs

@Thulieblack I'd suggest checking out the example notebook I provided. It won't give you everything you need but hopefully you'll be able to figure out how to adapt the code to the data format in the notebook you're working on.

Thank you for this @Isaac . The example notebook would really o a long way

Additionally for all: just a head's up that as we enter the weekend, you should expect responses from us (the mentors) to slow down and possibly have to wait until the week starts again. Monday is also a holiday in the United States (Indigenous People's Day) where I am based, so it is quite likely that responses will be slow then too. In the meantime, you should all still feel free to jump in and help each other. Thanks!

I'm trying to read the uncompressed data using the gzip library but everytime i run the cell, i get this error, does anyone know what's causing this?
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Update: I found a workaround by iterating and displaying selective parts of data.

I think its because you're trying to get too much data into your PAWS notebook. Try extracting fewer lines of the file.

# Wikidata JSON dump we'll start processing (56 GB in size, compressed) so far too large to process the whole thing right now
!ls -shH "{WIKIDATA_DIR}{WIKIDATA_DUMP_FN}"

I'm still going through the Wikidata example, do you know what the shH option might mean? I can't find it online

diego added a comment.Sat, Oct 10, 5:57 PM
# Wikidata JSON dump we'll start processing (56 GB in size, compressed) so far too large to process the whole thing right now
!ls -shH "{WIKIDATA_DIR}{WIKIDATA_DUMP_FN}"

I'm still going through the Wikidata example, do you know what the shH option might mean? I can't find it online

Here you have: https://man7.org/linux/man-pages/man1/ls.1.html

Thulieblack added a comment.EditedSat, Oct 10, 5:58 PM

@Toluwani7 I think it's the command used to read the compressed file,I don't know if I'm wrong

Amamgbu added a comment.EditedSat, Oct 10, 6:08 PM
# Wikidata JSON dump we'll start processing (56 GB in size, compressed) so far too large to process the whole thing right now
!ls -shH "{WIKIDATA_DIR}{WIKIDATA_DUMP_FN}"

I'm still going through the Wikidata example, do you know what the shH option might mean? I can't find it online

I am not too sure but I think !ls -shH means list short format with readable file size.

I believe what this means is that it lists all files in a directory in its short form and shows the appropriate file size

diego added a comment.Sat, Oct 10, 8:56 PM
# Wikidata JSON dump we'll start processing (56 GB in size, compressed) so far too large to process the whole thing right now
!ls -shH "{WIKIDATA_DIR}{WIKIDATA_DUMP_FN}"

I'm still going through the Wikidata example, do you know what the shH option might mean? I can't find it online

Here you have: https://man7.org/linux/man-pages/man1/ls.1.html

-s, --size

print the allocated size of each file, in blocks

-H, --dereference-command-line

follow symbolic links listed on the command line

-h, --human-readable

with -l and -s, print sizes like 1K 234M 2G etc.

@Isaac
The protection data that I got from MediaWiki dump and API results seems to be in different format.
For a particular page, we get data in the following formats..
In MediaWiki dump : (3664672,'edit','autoconfirmed',0,NULL,'infinity',717409)

In API result : {'pageid': 3664672, 'ns': 10, 'title': 'Template:Cyclopaedia 1728', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2020-10-10T18:45:26Z', 'lastrevid': 952065611, 'length': 3572, 'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}], 'restrictiontypes': ['edit', 'move']}

Isn't that a problem? like how are we supposed to check the discrepancies between the two if they are in different formats? Moreover the protection data in API results doesnt show user specific restrictions or sysop permissions. Is that ok?

@Isaac
The protection data that I got from MediaWiki dump and API results seems to be in different format.
For a particular page, we get data in the following formats..
In MediaWiki dump : (3664672,'edit','autoconfirmed',0,NULL,'infinity',717409)

In API result : {'pageid': 3664672, 'ns': 10, 'title': 'Template:Cyclopaedia 1728', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2020-10-10T18:45:26Z', 'lastrevid': 952065611, 'length': 3572, 'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}], 'restrictiontypes': ['edit', 'move']}

Isn't that a problem? like how are we supposed to check the discrepancies between the two if they are in different formats? Moreover the protection data in API results doesnt show user specific restrictions or sysop permissions. Is that ok?

Hi @SafiaKhaleel, it is possible to compare those tuples with the JSON objects through indexing. You can check the tutorial notebook @Isaac shared to us for more details.

@Isaac mentioned that the user specific restrictions was an obsolete field and we should disregard it. For the sysop permissions, it is stored in the JSON data as “level”

@Isaac
The protection data that I got from MediaWiki dump and API results seems to be in different format.
For a particular page, we get data in the following formats..
In MediaWiki dump : (3664672,'edit','autoconfirmed',0,NULL,'infinity',717409)

In API result : {'pageid': 3664672, 'ns': 10, 'title': 'Template:Cyclopaedia 1728', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2020-10-10T18:45:26Z', 'lastrevid': 952065611, 'length': 3572, 'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}], 'restrictiontypes': ['edit', 'move']}

Isn't that a problem? like how are we supposed to check the discrepancies between the two if they are in different formats? Moreover the protection data in API results doesnt show user specific restrictions or sysop permissions. Is that ok?

Hi @SafiaKhaleel, it is possible to compare those tuples with the JSON objects through indexing. You can check the tutorial notebook @Isaac shared to us for more details.

@Isaac mentioned that the user specific restrictions was an obsolete field and we should disregard it. For the sysop permissions, it is stored in the JSON data as “level”

Thanks @Amamgbu . I understood that. But the protection type in both the cases also seems to be different. In the API result, all the pages have both edit and move protection as you can see here:
'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}]
whereas in MediaWiki dump, there is only one type out of the two: (39620487,'edit','autoconfirmed',0,NULL,'infinity',692808)

This comment was removed by Divvya24.

@Isaac
The protection data that I got from MediaWiki dump and API results seems to be in different format.
For a particular page, we get data in the following formats..
In MediaWiki dump : (3664672,'edit','autoconfirmed',0,NULL,'infinity',717409)

In API result : {'pageid': 3664672, 'ns': 10, 'title': 'Template:Cyclopaedia 1728', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2020-10-10T18:45:26Z', 'lastrevid': 952065611, 'length': 3572, 'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}], 'restrictiontypes': ['edit', 'move']}

Isn't that a problem? like how are we supposed to check the discrepancies between the two if they are in different formats? Moreover the protection data in API results doesnt show user specific restrictions or sysop permissions. Is that ok?

Hi @SafiaKhaleel, it is possible to compare those tuples with the JSON objects through indexing. You can check the tutorial notebook @Isaac shared to us for more details.

@Isaac mentioned that the user specific restrictions was an obsolete field and we should disregard it. For the sysop permissions, it is stored in the JSON data as “level”

Thanks @Amamgbu . I understood that. But the protection type in both the cases also seems to be different. In the API result, all the pages have both edit and move protection as you can see here:
'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}]
whereas in MediaWiki dump, there is only one type out of the two: (39620487,'edit','autoconfirmed',0,NULL,'infinity',692808)

There are actually two if you inspect the data well. I had that same issue till i inspected the ids

@Isaac
The protection data that I got from MediaWiki dump and API results seems to be in different format.
For a particular page, we get data in the following formats..
In MediaWiki dump : (3664672,'edit','autoconfirmed',0,NULL,'infinity',717409)

In API result : {'pageid': 3664672, 'ns': 10, 'title': 'Template:Cyclopaedia 1728', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2020-10-10T18:45:26Z', 'lastrevid': 952065611, 'length': 3572, 'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}], 'restrictiontypes': ['edit', 'move']}

Isn't that a problem? like how are we supposed to check the discrepancies between the two if they are in different formats? Moreover the protection data in API results doesnt show user specific restrictions or sysop permissions. Is that ok?

Hi @SafiaKhaleel, it is possible to compare those tuples with the JSON objects through indexing. You can check the tutorial notebook @Isaac shared to us for more details.

@Isaac mentioned that the user specific restrictions was an obsolete field and we should disregard it. For the sysop permissions, it is stored in the JSON data as “level”

Thanks @Amamgbu . I understood that. But the protection type in both the cases also seems to be different. In the API result, all the pages have both edit and move protection as you can see here:
'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}]
whereas in MediaWiki dump, there is only one type out of the two: (39620487,'edit','autoconfirmed',0,NULL,'infinity',692808)

There are actually two if you inspect the data well. I had that same issue till i inspected the ids

Oh so do you mean to say there are two page protection entries for a single page in MediaWiki dumps? One for edit and another for move

@Isaac
The protection data that I got from MediaWiki dump and API results seems to be in different format.
For a particular page, we get data in the following formats..
In MediaWiki dump : (3664672,'edit','autoconfirmed',0,NULL,'infinity',717409)

In API result : {'pageid': 3664672, 'ns': 10, 'title': 'Template:Cyclopaedia 1728', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2020-10-10T18:45:26Z', 'lastrevid': 952065611, 'length': 3572, 'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}], 'restrictiontypes': ['edit', 'move']}

Isn't that a problem? like how are we supposed to check the discrepancies between the two if they are in different formats? Moreover the protection data in API results doesnt show user specific restrictions or sysop permissions. Is that ok?

Hi @SafiaKhaleel, it is possible to compare those tuples with the JSON objects through indexing. You can check the tutorial notebook @Isaac shared to us for more details.

@Isaac mentioned that the user specific restrictions was an obsolete field and we should disregard it. For the sysop permissions, it is stored in the JSON data as “level”

Thanks @Amamgbu . I understood that. But the protection type in both the cases also seems to be different. In the API result, all the pages have both edit and move protection as you can see here:
'protection': [{'type': 'edit', 'level': 'autoconfirmed', 'expiry': 'infinity'}, {'type': 'move', 'level': 'autoconfirmed', 'expiry': 'infinity'}]
whereas in MediaWiki dump, there is only one type out of the two: (39620487,'edit','autoconfirmed',0,NULL,'infinity',692808)

There are actually two if you inspect the data well. I had that same issue till i inspected the ids

Oh so do you mean to say there are two page protection entries for a single page in MediaWiki dumps? One for edit and another for move

Yeah. One for edit and another for move.

[When appropriate, please strip unneeded full quotes of previous comments; to keep comments readable - thanks everyone!]

vesna added a subscriber: vesna.Sun, Oct 11, 8:06 PM

Perfect evening
I'm really confused and need someone to clarify for me in cell 5
To do an example that loops through all pages and extract data how do I do that as the python docs don't really give example a or docs

You could check out this tutorial on how to create for loops with python (https://www.w3schools.com/python/python_for_loops.asp). I think the idea is to first clean the data before you try to loop. I would be happy to help you out with any issue you face along the way.

Hi everyone, I've been able to decompress the file but please how do I clean the data

Hi everyone, I've been able to decompress the file but please how do I clean the data

Hi @KemmieKemy, I would suggest trying to extract “INSERT INTO” since that is where the values start from. After which you can clean further and extract the values as tuples or whatever format you’re comfortable with.

KemmieKemy added a comment.EditedMon, Oct 12, 7:40 AM

Hi @KemmieKemy, I would suggest trying to extract “INSERT INTO” since that is where the values start from. After which you can clean further and extract the values as tuples or whatever format you’re comfortable with.

Okay, thank you @Amamgbu . I'll try this

@Isaac
Hi, hoping to get some clarification on my interpretation of the API responses.
So when querying for a specific property to be returned, I noticed that the returned responses may or may not contain said property:

The way I interpret this is that the presence of the property acts as an indicator for it. This seems to be the case for the revisions_anon property, at least. Formatting as a dataframe for easier comparison (where NaN values mean that the property is absent from the entry), we see that the revisions_anon column matches up with the revisions_user column, where an IP address means an anonymous user.

Would it be safe to assume this? And is this generally the case for most properties? Eg. could I interpret the revisions_minor flags property similarly, where "" indicates a minor edit, and NaN otherwise?

Isaac added a comment.Mon, Oct 12, 6:49 PM

Thanks all for the questions and discussion! I appreciate that you have been helping out without giving exact answers -- i.e. letting people still write their own code but helping them with direction / assumptions etc. It looked to me that all the questions except the one below had been answered, but if I missed any, just remind me.

The way I interpret this is that the presence of the property acts as an indicator for it.
...
Would it be safe to assume this? And is this generally the case for most properties? Eg. could I interpret the revisions_minor flags property similarly, where "" indicates a minor edit, and NaN otherwise?

@0xkaywong it's probably safe to assume this though not all APIs are guaranteed to be stable. For the purposes of this project, I think the best thing to do is to document this assumption (just a comment in the code is fine), and where possible, include checks or test cases that would help tell you if the assumption is being violated.

That makes sense, thanks for answering!

Adding on to the above answer, I found this documentation which confirms that for Boolean parameters:

Boolean parameters work like HTML checkboxes: if the parameter is specified, regardless of value, it is considered true. For a false value, omit the parameter entirely.

See under Data Types:boolean

Like Isaac answered above, it's definitely still best to document your assumptions in case the API responses are unclean. But hope this helps someone else, as this aspect of the API response formatting was unclear for me.

This comment was removed by tanny411.
tanny411 added a comment.EditedTue, Oct 13, 11:09 AM

Hi, regarding comparison of dump and API data, should we compare all data or 10 randomly selected ones. Just to be sure if the API will support calling for lots of page ids.
Thanks

Hi, regarding comparison of dump and API data, should we compare all data or 10 randomly selected ones. Just to be sure if the API will support calling for lots of page ids.
Thanks

You can select 10 random ID's from the dump to get 10 results back from the API and work with those. Hope that helps!

Has anybody else noticed the duplicate entries in the API response? They look like duplicates to me or is there a pattern that I am overseeing?

Isaac added a comment.Tue, Oct 13, 7:30 PM

Has anybody else noticed the duplicate entries in the API response? They look like duplicates to me or is there a pattern that I am overseeing?

Hey @MelanieImf, interesting find -- I don't actually know what is causing that but I suspect one of two things:

  • it's a very old page and at some point, the way that protections were tracked by the Mediawiki software that runs Wikipedia was changed and caused this duplicate
  • one set of protections was applied to the page directly and another was cascaded down and the API treats them as two separate things even though they have the same result.

Hopefully it shouldn't cause any issues (in this case, the protections seemed to agree) but if you encounter instances where e.g., edit protection expires both in a month and never, you can probably ignore the one-month expiration.

Additionally, a question was raised at one point about memory usage by the Notebooks. They are capped at I believe 3GB -- depending on how you process the data, you might reach this limit. If you do, you should either: a) see if you can change how you store the data as there may be a more efficient way, or, b) just take a sample of the data and state explicitly why you did that and how you selected a representative sample. While it can be a pain, memory constraints are useful for making sure you have code that can work in a variety of environments.

Hey @MelanieImf, interesting find -- I don't actually know what is causing that but I suspect one of two things:

  • it's a very old page and at some point, the way that protections were tracked by the Mediawiki software that runs Wikipedia was changed and caused this duplicate
  • one set of protections was applied to the page directly and another was cascaded down and the API treats them as two separate things even though they have the same result.

@Isaac thanks for the note! Potentially it's a mixture of both? There are indeed sets of protections that seem to be a result of cascading, as seen in the screenshot below. Other sets that have no "source" key might be affected by the former cause? In any case, I haven't encountered a result where the duplicates seem different so far!

.

tanny411 added a comment.EditedWed, Oct 14, 12:31 AM

Thanks @MelanieImf. @Isaac thanks for addressing the memory issue. It seems I had a bug in the code that cause high memory usage. Fixing that fixed the issue so I removed the comment. 😃

In the section to analyse the data, we have the option to use page_table and API, if I am not wrong I should be using one of them since the data may be different. The fields don't seem to overlap, so can/should we use both sources?

@tanny411, I think you can use either, or even one to supplement the other. The API responses are more up to date, but you'd probably have to filter for protected pages yourself if you're generating from the API. The API does have an option to list pages with protections, but a caveat I found with that option is that results are enumerated sequentially, rather than randomly.

Does anyone know if we can obtain the timestamp for when a protection was added for a page? It seems to be an option for protected titles, but I haven't had any luck in finding out how to do this for protected pages.

Does anyone know if we can obtain the timestamp for when a protection was added for a page? It seems to be an option for protected titles, but I haven't had any luck in finding out how to do this for protected pages.

Hi @0xkaywong, I’m not sure if that property is available for protected pages.

Isaac added a comment.Wed, Oct 14, 3:13 PM

I’m not sure if that property is available for protected pages.

Yeah, I can confirm that. There are methods for determining when a page was protected by parsing the logging table if you're curious but that table is 3GB compressed for English Wikipedia so I have no expectation that you would do that as part of this task.

Thanks @Amamgbu and @Isaac! Interesting to know, all the same.

# Wikidata JSON dump we'll start processing (56 GB in size, compressed) so far too large to process the whole thing right now
!ls -shH "{WIKIDATA_DIR}{WIKIDATA_DUMP_FN}"

I'm still going through the Wikidata example, do you know what the shH option might mean? I can't find it online

Here you have: https://man7.org/linux/man-pages/man1/ls.1.html

-s, --size

print the allocated size of each file, in blocks

-H, --dereference-command-line

follow symbolic links listed on the command line

-h, --human-readable

with -l and -s, print sizes like 1K 234M 2G etc.

Got it! Thanks

Hi,
I am unable to understand "head -46 " in below line code.

!zcat "{DUMP_DIR}{DUMP_FN}" | head -46 | cut -c1-1000

Secondly, I am having error

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

How to solve this error?

In general, please see https://man7.org/linux/man-pages/man1/head.1.html for explanation of Linux commands and parameters (like head) - thanks!

Hi,
I am unable to understand "head -46 " in below line code.

!zcat "{DUMP_DIR}{DUMP_FN}" | head -46 | cut -c1-1000

Secondly, I am having error

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

How to solve this error?

The error is coming from your code. You are probably iterating endlessly and it exhausted the notebook’s memory

head displays the beginning of a file. -46 would show the first 46 lines

Answering the question from T263646#6545673

Has anyone here been able to work with the page table dump without running out to memory.?Would appreciate some tips:)

@Liz_Kariuki thanks for the question. In general you want to take two strategies when dealing with memory challenges:

  • Process the file incrementally so you don't have to store it in memory all at once
  • Only retain the data you need and store it efficiently

Hi,
I am extremely sorry for this stupid question, but can anyone please guide me a bit about what is happening in "example of working with Wikidata data" in cell 5. Basically, I am having difficulty in understanding the format of data extraction from json dump. What checks are being confirmed by if and for statements?

Thanks

Amamgbu added a comment.EditedFri, Oct 16, 9:46 PM

Hi,
I am extremely sorry for this stupid question, but can anyone please guide me a bit about what is happening in "example of working with Wikidata data" in cell 5. Basically, I am having difficulty in understanding the format of data extraction from json dump. What checks are being confirmed by if and for statements?

Thanks

Hi @SIBGHAsheikh,
Could you please try to narrow down what you do not understand in that cell?

This would help us to explain things better to you.

From the code in the example based on my understanding, the following checks were made:

  • Ensured that a limit of 12,000 lines were processed probably to conserve memory
  • The number of sitelinks were calculated by accessing the json objects sitelink and made a check to see if the sitelink ended with “wiki” but was not commonswiki and specieswiki
  • number of statements was calculated by getting the number of claims.
  • next was getting if a content was about a human or not. This was gotten by trying to extract certain ids and values from the json data. If found, content is about a human

Then finally data needed is appended to a list to form a list of tuples

Hope this helps!

Amamgbu added a comment.EditedFri, Oct 16, 10:16 PM

@Isaac and everyone.

After inspecting the data in the API response, I can see two types of protection in the "restrictiontypes" key in the image below. However, there is just a single protection type present in the "protection" key.

Is it safe to assume that no "edit" protection exists in this case or does it inherit same protection as that of "move"?

Hi @Amamgbu
I think that this issue was addressed here, I hope that this helps you.

I want to confirm if the restrictiontype shown in API response (please see attached screenshot) same as user-specific restrictions field in Mediawiki dumps (please see attached screenshot) ?

Restrictiontypes just lists the potential protections a page could have. Generally, it's going to be edit and move but, for example, there are articles that haven't been created yet but have a protected title (i.e. you can't create a page with that title) and they would have a create edit restriction.

user-specific restrictions

You can ignore this field -- it is not in use.

Thanks @0xkaywong for helping to answer!

Also I want to know what is this "source" represent in API response (please see attached screenshot)

@Sidrah_M_Siddiqui good question: I'm not sure which page that is, but I assume it's a cascading protection where the article you queried was included on the Main Page and therefore received the same protections as the Main Page.

Hi @Amamgbu
I think that this issue was addressed here, I hope that this helps you.

Not sure if this is same in this case. In this API response, the article exists. Here, it has a move protection but no edit protection yet has a move and edit restriction type in place.

Does that mean the title of the Article cannot be edited except by the admin user(sysop) yet anyone can edit the body of the article since no edit protection exists in the “protection” key?

Isaac added a comment.Sat, Oct 17, 4:47 PM

Does that mean the title of the Article cannot be edited except by the admin user(sysop) yet anyone can edit the body of the article since no edit protection exists in the “protection” key?

@Amamgbu that is correct and @Vanevela pointed to the appropriate prior discussion about this. More details: the restrictiontypes field is just what restrictions could be applied to the page, not which ones are applied -- a fuller description of what you could find in that field can be found here. For most pages, you'll see edit and move and can verify this by choosing a random page without restrictions and querying the API. I'd suggest ignoring the field as it won't tell you much.

I am extremely sorry for this stupid question

@SIBGHAsheikh nothing to apologize for. @Amamgbu gave a good overview in T263874#6556803 but if you have further questions, just try to be as specific as you can about what you don't understand. If you're having trouble understanding specific aspects of how I processed the Wikidata dumps in that notebook though, know that you don't have to fully understand that to work on the page protections notebook. They are two very different datasets -- the Wikidata notebook was just provided to give an example of the types of things you might want to do.

This comment was removed by Vanevela.

My notebook is frozen, did anyone face the same thing? How did you solve it?

@Amamgbu that is correct and @Vanevela pointed to the appropriate prior discussion about this. More details: the restrictiontypes field is just what restrictions could be applied to the page, not which ones are applied -- a fuller description of what you could find in that field can be found here. For most pages, you'll see edit and move and can verify this by choosing a random page without restrictions and querying the API. I'd suggest ignoring the field as it won't tell you much.

Thanks @Isaac and @Vanevela for the help. I understand better now

Please I don't know what I did wrong.

Is anyone having difficulty running the notebook?

@Isaac and everyone,
Is there any way to get the reason of protection for a given page? If yes from where do we get that data

Perfect day I need to ask,my API response doesn't show protection type unlike the Dump data,is it something I need to worry about?

@Toluwani7 reload the page and run the server afresh it worked for me

Perfect day I need to ask,my API response doesn't show protection type unlike the Dump data,is it something I need to worry about?

Hi @Thulieblack
Could you send a screenshot of the response? If you queried properly, you should be able to see protection type. It is located in the “protection” key of the API response for each page ID

@Amamgbu take a look{F32397114}

Not sure you passed in any parameter for inprop when making your query. That is probably why you get only page info and no protection.

@Amamgbu take a look{F32397114}

Not sure you passed in any parameter for inprop when making your query. That is probably why you get only page info and no protection.

@Amamgbu inprop gave me an error I passed the prop*inf that's how I got all the queries

Amamgbu added a comment.EditedSun, Oct 18, 10:41 AM

@Amamgbu inprop gave me an error I passed the prop*inf that's how I got all the queries

Please go through the API documentation for the query you want to make. Since you want to get the protection, you should be able to pass in protection as an argument for inprop

Okay let me investigate and see what I can come up with

I just realised that I uploaded the wrong pictures. I don't get what I did wrong here

@Toluwani7 reload the page and run the server afresh it worked for me

It didn't work. My laptop froze and I had to hard press its power button. I opened a new notebook, thanks

@Isaac and everyone ,
Can we access the variable 'page_counter' from the page table as it had been removed completely in MediaWiki 1.25. Is there any other method to get views of each page?

@Amamgbu thanks I have sorted it out 🤗🤗

@Isaac and everyone ,
Can we access the variable 'page_counter' from the page table as it had been removed completely in MediaWiki 1.25. Is there any other method to get views of each page?

You can query for page view. You can reference the MediaWiki query API documentation to get this info. Though i think it brings up a max of 60 days.

@Amamgbu thanks I have sorted it out 🤗🤗

Awesome! You’re welcome

@Isaac

Hello! I am a bit confused. Like you mentionned earlier, it appears that most pages have edit and move permissions and most of those permissions have autoconfirmed or sysop as the level. So I am a bit confused as to what we should be trying to predict. The whole idea is to study page protections but I have a sample of about more than 80 000 pages (from the latest page protections data dump) and the protections applied to them seem very similar. Am I missing something? Thanks in advance!

@Isaac and everyone ,
Can we access the variable 'page_counter' from the page table as it had been removed completely in MediaWiki 1.25. Is there any other method to get views of each page?

You can query for page view. You can reference the MediaWiki query API documentation to get this info. Though i think it brings up a max of 60 days.

Thanks @Amamgbu But a lot of pages seems to have 'null' value for the pageviews variable.
And another problem I encountered is with the limit upto which we can query page information using title/pageid. Getting only 50 instances at a time is not enough to work on right? Does anyone know how to increase the limit upto 500? (Its said "500 for clients allowed higher limits")
@Isaac

Amamgbu added a comment.EditedMon, Oct 19, 2:58 PM

Thanks @Amamgbu But a lot of pages seems to have 'null' value for the pageviews variable.
And another problem I encountered is with the limit upto which we can query page information using title/pageid. Getting only 50 instances at a time is not enough to work on right? Does anyone know how to increase the limit upto 500? (Its said "500 for clients allowed higher limits")
@Isaac

I think you can use pvipcontinue to extract more data but I’m not sure if it gives a definite value of 500 which you’re searching for.

@Isaac would be in the best position to help you out.

Thulieblack added a comment.EditedMon, Oct 19, 8:49 PM

Perfect evening sorry for asking alot,I just need some clarity.
For the exploratory which data are we using dump or api??

Isaac added a comment.Mon, Oct 19, 9:03 PM

I think you can use pvipcontinue to extract more data

@Amamgbu @SafiaKhaleel indeed -- depending on your exact query, you can use a continue parameter to get more results or just pass a new set of pageIDs to the API to get more data.

Like you mentionned earlier, it appears that most pages have edit and move permissions and most of those permissions have autoconfirmed or sysop as the level. So I am a bit confused as to what we should be trying to predict. The whole idea is to study page protections but I have a sample of about more than 80 000 pages (from the latest page protections data dump) and the protections applied to them seem very similar. Am I missing something? Thanks in advance!

@YemiKifouly good question. If you don't see a good predictive problem with just the page protection data, you can also pull in data on pages without protections via the Random API (or any other page generator API).

For the exploratory which data are we using dump or api??

@Thulieblack You'll likely have more data form the dump so that would be my suggestion but in theory they should have the same data.

@Isaac thank you so much for the clarity much appreciated 👏🏿👏🏿

Tambe added a subscriber: Tambe.Tue, Oct 20, 4:07 AM

Greetings everyone. My name is Tambe Tabitha Achere and I am a Data scientist. I am Cameroonian and I code in Python. I look forward to working with you.