Context
The current scraper did download the per wiki dumps from the public directory. This is not kept up to date anymore so we might have to look for alternatives.
Task
- See if we're still able to retrieve dump files when the scraper is executed on the WMF servers
- No, we will have to use the API, there is no filesystem access.
- Implement Enterprise API module in https://gitlab.com/wmde/technical-wishes/mediawiki_client_ex , with verbs for the dump actions we need:
- Check if there's a change in format that we need to consider when working with the data & apply fixes if feasible
- Rewrite the application to retrieve dumps through the API.
- Set up our team's private Enterprise credentials on the analytics cluster and document internally.
- Implement unauthenticated mode once the analytics cluster is allow-listed.
Api docs:
Implementation notes
Work in progress branches:
- https://gitlab.com/wmde/technical-wishes/mediawiki_client_ex/-/merge_requests/22
- https://gitlab.com/wmde/technical-wishes/mediawiki_client_ex/-/merge_requests/25
- https://gitlab.com/wmde/technical-wishes/mediawiki_client_ex/-/merge_requests/23
- https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/135
- The Enterprise integration belongs in the mediawiki_client library, with the generalized goal of accessing "Enterprise API" resources. It's fine to prototype inside of the scraper and move the logic later, but note that the client library is already set up with API fixtures and tests. Anything beyond authentication and the Snapshot API is out of scope.
- It seems that the built-in Erlang zlib and erl_tar libraries can't parse the gzip for unknown reasons, but command-line tar xzf can. Happily, this is already the tooling we've based the scraper on, so we don't have to do extra work. Unfortunately, we won't get streaming with this setup so memory usage will likely be several GB.