Page MenuHomePhabricator

Explore moving MWOffliner/Kiwix from API calls to Wikimedia Enterprise HTML dumps
Open, Needs TriagePublic

Description

Context:
I met with @Kelson from Kiwix in Paris in February 2023 and they discussed moving the MWOffliner/Kiwix systems from hitting the public Wikimedia APIs to Wikimedia Enterprise, which batches in dumps parsoid response. We agreed in principle that this was a good idea and something to explore further.

What Wikimedia Enterprise has that is valuable here:
Dumps of all of the "text-based" language projects available daily that contain the Parsoid HTML (among other things). We have them publicly available here every two weeks. See docs as to projects and namespaces covered by our APIs today.

Next Steps

  1. I am heading onto leave for the next few weeks and would like to bring in more technical folks from the Wikimedia Enterprise side to help run a true process. I am cc'ing on @HShaikh and @Protsack.stephan from the Wikimedia engineering team. They will follow up.
  2. @Kelson if you could look through our docs which I think answers the questions in this issue on github of namespaces we offer and document anything else you might need, it could kick off the conversation.
  3. Use this ticket to communicate (or happy to do a call/etc.) on what might be missing...I am happy to take a look if there are reasonable things we can add to make this work.

Other notes

Event Timeline

@RBrounley_WMF Thank you for opening this ticket, I will follow up.

@Kelson I am back from vacation and happy to help get things going on this as needed.

@Kelson pinging to see if this is still on the radar.

Yes, this is on the radar, but we face pretty much difficulties around mobile-html new end-point for the moment. So not a TOP prio our side. I honestly don't think we will be able to tackle this before 2024.

AFAIK there is anyway challenge on the HTML dump itself like:

  • Mobile HTML is not available
  • JS/CSS resources are not available either