Hello!
We have completed a very nascent form of complete HTML dumps and looking for feedback on the quality and help us plan the right paths to improve them. We have a dump for each text-based Wiki project (not commons or wikidata) in HTML, building this as the first exploratory part of an API project aimed at our largest users.
Thought the best route here would be to provide access to a public drive folder which folks are free to download from to take a look. In the folder right now is simple english wiki (944mb) which seems to be a good case to look at. Please let me know other languages you'd like - I can put them in the drive folder (except english-wiki which is absolutely massive...).
Things I'm curious to learn:
- Is the HTML complete/adequate for use at programmatic scale?
- If the file structure is clear and easy to work with?
- Is there content missing here that you would like to see?
- Is there extraneous content? Is there content that is not useful?
I will add more as I think of them. Let me know your thoughts and below I'm gonna add some feedback I've received already to kick off the conversation - thanks @Isaac
Thanks in advance everyone,
Ryan