Description
To fetch all the Scribunto modules in _all_ Wikimedia pages, several approaches can be taken. One of them is to use the Wikimedia API aka api.php. Database replicas in WM do not contain contents but contain page titles for all wikis.
Contents are to be fetched through cron jobs in Grid. For that, a python environment has to be created in the tool, and bash scripts are to be set up to activate the environment and run the python scripts. Multiple crons are set up, each querying contents for a subset of wikis.
Tasks
Python Script
- Read input (list of wiki) from csv file
- Call API (pagination involved) and grab contents of Scribunto modules page by page
- Save results as csv (temporary)
- Make the contents re-written every day: Rewrites in database as updates
- Fetch missed contents
- Change output method to save results in user_database in toolsdb
- Add checks for revision and update only when contents have changed (to be done later)
- Check and add pages found from db and not from api
Bash
- Create bash script to run cron jobs
- Add helper comments to facilitate env creation and addition of commands in crontab
- Run test for the corn jobs
- Solve memory errors (test # of crons vs # of wikipages per cron job)
- Solve other errors (e.g. Cant fetch site etc) (Could not reproduce, re-open issue if occurs again)
- Compare page stats collected with API to db
- Set final cron jobs
Optional
- Send error reports everyday after cron job completion (additionally download .err and .out files automatically)
- Remove missed pages that have already been checked
- Remove db pages not loaded (i.e not Scribunto modules)
Update (18-12-20)
Collected all lua contents as csv. Missed contents are still to be re-fetched and scripts to be adjusted for long time use.
- Time taken: ~1 hour
- Memory required: 1.8G
- Missed? : Very little contents missed (82 pages). All have lastrevid = 0, i.e. no contents.
- Fetch Module list from db: ~30 minutes, 15M file size
Update (26-12-20)
After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules):
- 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored.
- 98 pages found from API but not in DB.