My team is creating bi-weekly HTML Dumps for all of the wikis, except for wikidata as the first exploratory part of an API project aimed at our largest users.
We've seen this ticket many times here on Phabricator so decided to create a forum for conversation. We've done some technical scoping to figure out the best way to approach this but would be very open to any other approaches, ideas, shortcuts, or general projects we should be aware of here at WMF. We're currently building this out initially on AWS so we're using them for serving/compute/storage/etc.
Tagging folks who we've spoken with in the past around the scope of this. Feel free to add more that you think would be interested.
Our approach:
- Pull titles from https://dumps.wikimedia.org/other/pagetitles/, record to our database
- Use Parsoid API https://en.wikipedia.org/api/rest_v1/ to record TIDs and pull HTML files to an S3 bucket
- Compress HTML files to a single dump and serve it.