#### Description
To fetch all the Scribunto modules in _all_ Wikimedia pages, several approaches can be taken. One of them is to use the [Wikimedia API](https://www.mediawiki.org/wiki/API:Main_page) aka api.php. Database replicas in WM do not contain contents but contain page titles for all wikis.
Contents are to be fetched through cron jobs in [Grid](https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid). For that, a python environment has to be created in the `tool`, and bash scripts are to be set up to activate the environment and run the python scripts. Multiple crons are set up, each querying contents for a subset of wikis.
#### Tasks
##### Python Script
- [x] Read input (list of wiki) from csv file
- [x] Call API (pagination involved) and grab contents of Scribunto modules page by page
- [x] Save results as csv (temporary)
- [x] Make the contents re-written every day: Rewrites in database as updates
- [x] Fetch missed contents
- [x] Change output method to save results in `user_database` in toolsdb
- [ ] Add checks for revision and update only when contents have changed (to be done later)
- [x] Check and add pages found from db and not from api
##### Bash
- [x] Create bash script to run cron jobs
- [x] Add helper comments to facilitate env creation and addition of commands in crontab
- [x] Run test for the corn jobs
- [x] Solve memory errors (test # of crons vs # of wikipages per cron job)
- [x] Solve other errors (e.g. Cant fetch site etc) (Could not reproduce, re-open issue if occurs again)
- [x] Compare page stats collected with API to db
- [x] Set final cron jobs
Optional
- [ ] Send error reports everyday after cron job completion (additionally download .err and .out files automatically)
- [x] Remove missed pages that have already been checked
- [x] Remove db pages not loaded (i.e not Scribunto modules)
#### Update (18-12-20)
Collected all lua contents as csv. Missed contents are still to be re-fetched and scripts to be adjusted for long time use.
- Time taken: ~1 hour
- Memory required: 1.8G
- Missed? : Very little contents missed (82 pages). All have `lastrevid` = 0, i.e. no contents.
- Fetch Module list from db: ~30 minutes, 15M file size
#### Update (26-12-20)
After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules):
- 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored.
- 98 pages found from API but not in DB.