Page MenuHomePhabricator

Scraper: Use Enterprise API to retrieve dumps
Closed, ResolvedPublicSpike

Description

Context

The current scraper did download the per wiki dumps from the public directory. This is not kept up to date anymore so we might have to look for alternatives.

Task

Api docs:

Implementation notes

Work in progress branches:

  • The Enterprise integration belongs in the mediawiki_client library, with the generalized goal of accessing "Enterprise API" resources. It's fine to prototype inside of the scraper and move the logic later, but note that the client library is already set up with API fixtures and tests. Anything beyond authentication and the Snapshot API is out of scope.
  • It seems that the built-in Erlang zlib and erl_tar libraries can't parse the gzip for unknown reasons, but command-line tar xzf can. Happily, this is already the tooling we've based the scraper on, so we don't have to do extra work. Unfortunately, we won't get streaming with this setup so memory usage will likely be several GB.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 12 2025, 1:59 PM
awight subscribed.

Happily, the Enterprise Dumps are still available from analytics clients. We'll run the job in that environment.

I've made some light updates to our internal docs and verified that the current scraper code can process a small wiki.

Update: I was wrong in my last comment, the Enterprise dumps will be behind the network paywall from all environments, even from analytics clients. After task T403298: Provide auth-less access to Enterprise APIs from WMF Analytics cluster is completed we should be able to use the API from analytics clients without needing access keys.

awight renamed this task from Scraper Spike: Look into how we can retrive the Enterprise dumps to Scraper: Use Enterprise API to retrieve dumps.Oct 16 2025, 9:42 AM
awight removed a project: Spike.
awight updated the task description. (Show Details)

Nothing to review here anymore only two points left:

Set up our team's private Enterprise credentials on the analytics cluster and document internally.
Implement unauthenticated mode once the analytics cluster is allow-listed.

Do we want to create another task for these?