Change Details

#### Description To fetch all the Scribunto modules in _all_ Wikimedia pages, several approaches can be taken. One of them is to use the [Wikimedia API](https://www.mediawiki.org/wiki/API:Main_page) aka api.php. Database replicas in WM do not contain contents but contain page titles for all wikis. Contents are to be fetched through cron jobs in [Grid](https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid). For that, a python environment has to be created in the `tool`, and bash scripts are to be set up to activate the environment and run the python scripts. Multiple crons are set up, each querying contents for a subset of wikis. #### Tasks ##### Python Script - [x] Read input (list of wiki) from csv file - [x] Call API (pagination involved) and grab contents of Scribunto modules page by page - [x] Save results as csv (temporary) - [x] Make the contents re-written every day: Rewrites in database as updates - [x] Fetch missed contents - [x] Change output method to save results in `user_database` in toolsdb - [ ] Add checks for revision and update only when contents have changed - [ ] Check and add pages found from db and not from api - [ ] Remove missed pages that have already been checked ##### Bash - [x] Create bash script to run cron jobs - [x] Add helper comments to facilitate env creation and addition of commands in crontab - [x] Run test for the corn jobs - [x] Solve memory errors (test # of crons vs # of wikipages per cron job) - [x] Solve other errors (e.g. Cant fetch site etc) (Could not reproduce, re-open issue if occurs again) - [ ] Send error reports everyday after cron job completion (additionally download .err and .out files automatically) - [x] Compare page stats collected with API to db - [ ] Set final cron jobs #### Update (18-12-20) Collected all lua contents as csv. Missed contents are still to be re-fetched and scripts to be adjusted for long time use. - Time taken: ~1 hour - Memory required: 1.8G - Missed? : Very little contents missed (82 pages). All have `lastrevid` = 0, i.e. no contents. - Fetch Module list from db: ~30 minutes, 15M file size

#### Description To fetch all the Scribunto modules in _all_ Wikimedia pages, several approaches can be taken. One of them is to use the [Wikimedia API](https://www.mediawiki.org/wiki/API:Main_page) aka api.php. Database replicas in WM do not contain contents but contain page titles for all wikis. Contents are to be fetched through cron jobs in [Grid](https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid). For that, a python environment has to be created in the `tool`, and bash scripts are to be set up to activate the environment and run the python scripts. Multiple crons are set up, each querying contents for a subset of wikis. #### Tasks ##### Python Script - [x] Read input (list of wiki) from csv file - [x] Call API (pagination involved) and grab contents of Scribunto modules page by page - [x] Save results as csv (temporary) - [x] Make the contents re-written every day: Rewrites in database as updates - [x] Fetch missed contents - [x] Change output method to save results in `user_database` in toolsdb - [ ] Add checks for revision and update only when contents have changed (to be done later) - [x] Check and add pages found from db and not from api ##### Bash - [x] Create bash script to run cron jobs - [x] Add helper comments to facilitate env creation and addition of commands in crontab - [x] Run test for the corn jobs - [x] Solve memory errors (test # of crons vs # of wikipages per cron job) - [x] Solve other errors (e.g. Cant fetch site etc) (Could not reproduce, re-open issue if occurs again) - [x] Compare page stats collected with API to db - [x] Set final cron jobs Optional - [ ] Send error reports everyday after cron job completion (additionally download .err and .out files automatically) - [x] Remove missed pages that have already been checked - [x] Remove db pages not loaded (i.e not Scribunto modules) #### Update (18-12-20) Collected all lua contents as csv. Missed contents are still to be re-fetched and scripts to be adjusted for long time use. - Time taken: ~1 hour - Memory required: 1.8G - Missed? : Very little contents missed (82 pages). All have `lastrevid` = 0, i.e. no contents. - Fetch Module list from db: ~30 minutes, 15M file size #### Update (26-12-20) After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules): - 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored. - 98 pages found from API but not in DB.

#### Description To fetch all the Scribunto modules in _all_ Wikimedia pages, several approaches can be taken. One of them is to use the [Wikimedia API](https://www.mediawiki.org/wiki/API:Main_page) aka api.php. Database replicas in WM do not contain contents but contain page titles for all wikis. Contents are to be fetched through cron jobs in [Grid](https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid). For that, a python environment has to be created in the `tool`, and bash scripts are to be set up to activate the environment and run the python scripts. Multiple crons are set up, each querying contents for a subset of wikis. #### Tasks ##### Python Script - [x] Read input (list of wiki) from csv file - [x] Call API (pagination involved) and grab contents of Scribunto modules page by page - [x] Save results as csv (temporary) - [x] Make the contents re-written every day: Rewrites in database as updates - [x] Fetch missed contents - [x] Change output method to save results in `user_database` in toolsdb - [ ] Add checks for revision and update only when contents have changed (to be done later) - [ - [x] Check and add pages found from db and not from api - [ ] Remove missed pages that have already been checked ##### Bash - [x] Create bash script to run cron jobs - [x] Add helper comments to facilitate env creation and addition of commands in crontab - [x] Run test for the corn jobs - [x] Solve memory errors (test # of crons vs # of wikipages per cron job) - [x] Solve other errors (e.g. Cant fetch site etc) (Could not reproduce, re-open issue if occurs again) - [x] Compare page stats collected with API to db - [x] Set final cron jobs Optional - [ ] Send error reports everyday after cron job completion (additionally download .err and .out files automatically) - [x] CompareRemove missed page stats collected with API to dbs that have already been checked - [ ] Set final cron jobs- [x] Remove db pages not loaded (i.e not Scribunto modules) #### Update (18-12-20) Collected all lua contents as csv. Missed contents are still to be re-fetched and scripts to be adjusted for long time use. - Time taken: ~1 hour - Memory required: 1.8G - Missed? : Very little contents missed (82 pages). All have `lastrevid` = 0, i.e. no contents. - Fetch Module list from db: ~30 minutes, 15M file size #### Update (26-12-20) After clearing and scrutinizing the data more, here is the summary (taking only from ns 828 and Scribunto modules): - 118 pages from DB not found from API allpages list. Of them 2 are actual scribunto modules and so loaded into our DB. Rest are not Scribunto modules although DB says so. Ignored. - 98 pages found from API but not in DB.