Page MenuHomePhabricator

[Abstract Wikipedia data science] Create parser for list of all existing wikis
Closed, ResolvedPublic

Description

Description

To fetch all the Scribunto modules in _all_ Wikimedia pages, first thing to do is to get list of them. That can be done by parsing an existing page on Meta-wiki. For further usage, parsed pages should be saved in text file.

The update (15.12.2020)

For fetching additional information we need to know names of different wikis in database too. So it is reasonable to switch to fetching this info from 'meta' wiki database copy, as written in here

But these tables don't have update time property, as they are just copies, which are not updated per se, the new copies are just loaded instead of old ones. So it makes sense to look at time of creation to check for updates.

Tasks
  • Parse existing wiki links from the page
  • Save them in text file, one line - one link
  • Add checks for page parsing (page unavailable, page changed...)
  • Add check if page have been updated recently (api request?)
  • Move to fetching info from database copies
  • Save request results as csv
  • Make "last update" checker
  • Put updater to cron