Page MenuHomePhabricator

[Abstract Wikipedia data science] Create scripts to fetch Module contents
Closed, ResolvedPublic

Description

Description

To fetch all the Scribunto modules in _all_ Wikimedia pages, several approaches can be taken. One of them is to use the Wikimedia API aka api.php. Database replicas in WM do not contain contents but contain page titles for all wikis.

Contents are to be fetched through cron jobs in Grid. For that, a python environment has to be created in the tool, and bash scripts are to be set up to activate the environment and run the python scripts. Multiple crons are set up, each querying contents for a subset of wikis.

Tasks
Python Script
  • Read input (list of wiki) from csv file
  • Call API (pagination involved) and grab contents of Scribunto modules page by page
  • Save results as csv (temporary)
  • Make the contents re-written every day: Rewrites in database as updates
  • Fetch missed contents
  • Change output method to save results in user_database in toolsdb
  • Add checks for revision and update only when contents have changed (to be done later)
  • Check and add pages found from db and not from api
Bash
  • Create bash script to run cron jobs
  • Add helper comments to facilitate env creation and addition of commands in crontab
  • Run test for the corn jobs
    • Solve memory errors (test # of crons vs # of wikipages per cron job)
    • Solve other errors (e.g. Cant fetch site etc) (Could not reproduce, re-open issue if occurs again)
  • Compare page stats collected with API to db
  • Set final cron jobs

Optional

  • Send error reports everyday after cron job completion (additionally download .err and .out files automatically)
  • Remove missed pages that have already been checked
  • Remove db pages not loaded (i.e not Scribunto modules)
Update (18-12-20)

Collected all lua contents as csv. Missed contents are still to be re-fetched and scripts to be adjusted for long time use.

  • Time taken: ~1 hour
  • Memory required: 1.8G
  • Missed? : Very little contents missed (82 pages). All have lastrevid = 0, i.e. no contents.
  • Fetch Module list from db: ~30 minutes, 15M file size
Update (26-12-20)

After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules):

  • 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored.
  • 98 pages found from API but not in DB.

Event Timeline

I've tried to compare pages collected by API and db(id and titles only) by ids. Had to go through a LOT of memory errors to run this script.
This is the output:

Number of db pages: 275154
Number of api pages: 274543
Number of unique pages in db: 740 # pages not found from API calls
Number of unique pages in api: 129 # pages not listed from db queries
Ok

It seems there are some discrepancies. I am looking into what these files are and if there's any pattern here.

@LostEnchanter I think it's a good idea to save data into databases and process from there. Loading contents gives couple of errors due to the presence of all kinds of symbols in the code (quotes and commas). Since we are going to use db anyways, I think its best not to try to solve all these errors now. (I did spend a good amount of time trying to load the csv to compare page entries with db, but then I went on with a work around for now)

I've tried to compare pages collected by API and db(id and titles only) by ids. Had to go through a LOT of memory errors to run this script.
This is the output:

Length of db pages: 275154
Length of api pages: 274543
Length of unique pages in db: 740 # pages not found from API calls
Length of unique pages in api: 129 # pages not listed from db queries
Ok

It seems there are some discrepancies. I am looking into what these files are and if there's any pattern here.

@LostEnchanter I think it's a good idea to save data into databases and process from there. Loading contents gives couple of errors due to the presence of all kinds of symbols in the code (quotes and commas). Since we are going to use db anyways, I think its best not to try to solve all these errors now. (I did spend a good amount of time trying to load the csv to compare page entries with db, but then I went on with a work around for now)

Can you please additionally describe, what do you mean by 'length' there? Amount of symbols in Lua sourcecode?

Can you please additionally describe, what do you mean by 'length' there? Amount of symbols in Lua sourcecode?

Ah, sorry for the confusion. It means number of rows basically. I've updated it to make it clearer.
So from DB, I get 740 pageids that I didn't get from API and vice versa.

I guess I wanted to write length of the dataframe 😅

Couple of confusion I ran into:

  1. So far we've been collecting contents from namespace 828 having content_model = Scribunto. While collecting from the database I found some Scribunto modules in other namespaces (~100) as well. I wanted to know if these are of our concern.
namespace#of pages
1058
229
111
41
  1. Some pages listed from database were not found from API pagination. For those I have separately run a script to grab their page content. Although most pages are infact Scribunto modules, some pages are shown wikitext from API, while Scribunto from database. And their contents are infact not lua codes. I am not sure why the database would mark them as Scribunto modules. See this section of notebook.
Update:

After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules):

  • 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored.
  • 98 pages found from API but not in DB.

The reason I compared what we can find from DB/API is that we are intending to collect other data (e.g. page view, usage etc) from DB. This may leave some of our pages out and give us redundant information (about wikitext for example).
The discrepancy maybe due to the fact that API gives us more recent information and it takes some time for the replica DB to update. With this situation at hand, the most reliable would be work with will be pages that were found from both API and DB (and therefore will have all the data we need) and these form 99% of the data anyways.