Page MenuHomePhabricator

HTML Dumps - June/2020
Closed, ResolvedPublic

Description

My team is creating bi-weekly HTML Dumps for all of the wikis, except for wikidata as the first exploratory part of an API project aimed at our largest users.

We've seen this ticket many times here on Phabricator so decided to create a forum for conversation. We've done some technical scoping to figure out the best way to approach this but would be very open to any other approaches, ideas, shortcuts, or general projects we should be aware of here at WMF. We're currently building this out initially on AWS so we're using them for serving/compute/storage/etc.

Tagging folks who we've spoken with in the past around the scope of this. Feel free to add more that you think would be interested.

Our approach:

  1. Pull titles from https://dumps.wikimedia.org/other/pagetitles/, record to our database
  2. Use Parsoid API https://en.wikipedia.org/api/rest_v1/ to record TIDs and pull HTML files to an S3 bucket
  3. Compress HTML files to a single dump and serve it.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm probably not the first person to think of this, but figuring out how to do all the extra page renders this will need in the "warm" datacenter (usually codfw) seems pretty ideal. I vaguely remember that a big issue with the last time this was tried naively was that Parsoid + MediaWiki had a hard time keeping up with the extra needed traffic. Today that might be a bit easier to do hardware planning for with Parsoid/PHP having replaced the nodejs sidecar service. But it seems like a nice win to put the unloaded DC to work in producing the HTML renders.

The additional datacenter is for availability reasons, not capacity. In case we lose the main dc for any reason, we won't be able to sustain all the load on a single datacenter, and that's unfeasible for a series of reasons, including operating ones (we need to completely depool one datacenter from time to time for very different reasons).

So while I agree it's a good idea to proably separate this traffic from user traffic, and one way to do that is probably to direct this traffic *in normal operating conditions* to the read-only datacenter, we should have the ability to sustain it from the main DC.

Having said that, not everything is as optimistic, as, the devil is in the details. How much staleness is acceptable? What to do with pages that fail to render because they consume too much resources? And the most critical question - how would we do it in RESTBase-less universe?

In a restbase-less universe, we would have caching at the CDN layer and parsoid would pre-store its render cache in (some form of) parsercache, I guess. But I don't see restbase going away anytime soon anyways.

I have one question: I see you're using AWS for building the project. This raises one fundamental question: Do you plan to just run a prototype on AWS, or is that intended to be used in production as well?

If you plan to use it in production, I have a few additional questions :)

I believe I don't understand why additional HTML dumps are necessary, but like @ArielGlenn has written we do all of this already on a monthly base:

Our approach:

Another idea. Analytics Engineering generates other kinds of dumps in Hadoop. You could too!

  • Use mediawiki_history* in Hadoop to get all page titles (and anything else you might need).
  • Use Parsoid API https://en.wikipedia.org/api/rest_v1/ to record TIDs and pull HTML files to HDFS and/or Hive table
  • Compress HTML files, export to dumps.wikimedia.org and and serve it.

Not that you should do it this way; just an idea, since we already have clustered compute resources for doing this type of stuff at WMF.

(The tricky part here the need to use the Parsoid API to get the content you want to dump. If we could somehow run the parser in Hadoop... :D But I think this has been tried?)

I'm know there have been many WMFers (@leila?) that have wanted to have article HTML content in Hadoop. Having it in HDFS first would allow it to be more easily used by internal WMF researchers and analysts.

Having all the content in HDFS would be hugely useful, and doing it right would imply solving _so_ many problems. We should definitely talk about this.

Would the idea be to also pull the html out of hdfs to make it available to dump downloaders, @Ottomata ?

Would the idea be to also pull the html out of hdfs to make it available to dump downloaders

This already happens for the many pageview and edit dumps we release, in an hourly cadence for some files. Monthly for others.

I have a number of questions folks may want to think about, after having read the document on officewiki.

Can this document be shared publicly? It's hard to ask questions without understanding the use cases/motivations behind this project. My first thought is why reinvent what Kiwix already does, given that's what the Movement is investing in, but maybe that's already been discussed.

Having all the content in HDFS would be hugely useful, and doing it right would imply solving _so_ many problems. We should definitely talk about this.

We already render that content for every edit in restbase, via changeprop. We'd just need to take the result of the re-parse and somehow save it in hdfs when it happens. It doesn't seem like an overly complex enterprise at first sight, but I don't think changeprop is able to do that as-is.

Pull titles from https://dumps.wikimedia.org/other/pagetitles/, record to our database
Use Parsoid API https://en.wikipedia.org/api/rest_v1/ to record TIDs and pull HTML files to an S3 bucket
Compress HTML files to a single dump and serve it.

The plan outlined on ticket seems one produced by an external party, not sure if that is the case. If there is ability to access our internal computation infrastructure this task can be done much more efficiently (like others have pointed out) .

Pull titles from https://dumps.wikimedia.org/other/pagetitles/, record to our database

Not sure if it is clear from T254275#6196493 above that such a database of page titles (for content pages, not files in commons) and revisions historically already exists and it is used daily, it is updated monthly on hadoop but is also available in the format of a json dump publicy: https://dumps.wikimedia.org/other/mediawiki_history/readme.html

I can only emphasis that a ticket which does not transparently explain the problem which is tried to be solved is going to be successfuly only by chance. Therefore, this is probably my last comment on this as we run here a discussion being blind. One of thing I heard is that that dumps might have to include the Parsoid sementic tags, which is not the case for the dumps issued by MWoffliner (MWoffliner remove them). If this is the case, a POC can be done within a few hours to avoid remove them, we can ever just store the raw HTML issued from the API JSON.

a ticket which does not transparently explain the problem

+1, could a problem statement be added?

having read the document on officewiki

Would it be possible to move this to a public place and link it?

Having it in HDFS first would allow it to be more easily used by internal WMF researchers and analysts.

Speaking personally but from the Research team, I also +1 this many many times over. There is so much text-based machine learning and analytics that would be many times easier / faster if we could access HTML in HDFS (because then we can take advantage of the SWAP system). Some recent research also recreated the full parsed HTML revision history for English Wikipedia and noted for example that over half of internal article links are only evident from the parsed article and not the raw wikitext. A few current examples of modeling etc. that would benefit that I know of:

  • Parsed versions of Wikipedia articles have way more links / content in them, which can be valuable for ML models like topic classification or quality prediction
  • Studying how much and what content is transcluded (has implications for patrolling etc.): T249654
  • Measuring the consistency of content in different language versions of the same article: T243256
  • Studying citation quality / usage, especially if templates like en:Cite Q see expanded usage in the wikis
  • For link recommendation -- i.e. suggesting to a user that they should insert a wikilink into an article -- you might want to verify that the link does not already exist in the article, which would be best done against the parsed version of the article

Thanks for the questions and queries, much appreciated and the team will try to give good responses ASAP. Thanks all for your patience

Copying the email I sent to wikimedia-l. As a volunteer I have lots of questions (sorry):

  • What is the "paid API" project? Are we planning to make money out of our API? Now are we selling our dumps?
  • If so, why is this not communicated before? Why are we kept in the dark?
  • This seems like a big change in our system. Does the board know and approve it?
  • How is this going to align with our core values like openness and transparency?
  • The ticket implicitly says these are going to be stored on AWS ("S3 bucket"). Is this thought through? Specially the ethical problems of feeding Jeff Bezos' empire? (If you have seen this episode of Hasan Minhaj's on ethical issues of using AWS [2]). Why can't we do/host this on Wikimedia infrastructure? Has this been evaluated?
  • Why is the community not consulted about this?

+1 to Joe's and Kelson's comments. Definitely start w the existing and excellent kiwix pipeline where possible.

OpenStack is a solid alternative to AWS, can be run within our own cloud and until then hosted instances are available.

In T254275#6222327, @Sj wrote:

OpenStack is a solid alternative to AWS, can be run within our own cloud and until then hosted instances are available.

You know Wikimedia Cloud is openstack, right? :)

WMF is actually planning to charge for acess? So what percentage of this do the creators of the content get? Otherwise it's stealing, cashing in on content that the WMF has not created.

I've just posted some context for this here: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-June/095008.html

More information and documentation will be coming soon, apologies this hasn't come sooner.

WMF is actually planning to charge for acess? So what percentage of this do the creators of the content get? Otherwise it's stealing, cashing in on content that the WMF has not created.

My understanding is that we are talking about extremely high volume access here. Certainly the dumps are and will remain free, access to the api for editors, researchers, community members and anybody else is and will remain free, etc.

My understanding is that we are talking about extremely high volume access here

Does not seem that would be the case since the paid API is restricted to a few clients and once an HTML dump is dowloaded there is no need to re-fetch. High volume access would happen in a situation where there might be a lot of distinct clients (edit: or also, loads of fresh data available).

FYI: Because it seems there is a knowledge/communication gap about openZIM/Kiwix dumping solution, a Tech talk is currently being planned (probably in August) https://phabricator.wikimedia.org/T255392. If you have questions/concerns/remarks, please make comments on that ticket. I will secure that the presentation address them.

Hey all -

Hugely appreciate the interest and thank you for providing some of the context for this project @Seddon. There are a lot of questions to work through and I’ve been speaking with several of you on irc, so I’ll work through answering and documenting them over the next couple of days. Please just keep in mind that we are doing a lot of discovery here around technical scoping. And bare with me, phabricator is really amazing, but I’m just getting the hang of it.

I have a number of questions folks may want to think about, after having read the document on officewiki.

Which things do you want to dump? Just articles? Portal pages too? Anything else? What about Wikidata; do you want to dump those pages as HTML as well, given the nature of Wikidata entities? What about Commons? Do you want to dump the File: pages as HTML, since those are the ones of interest? What about media included on these pages?

For now, just text based wiki projects, and wikidata/commons will come at a later point in this project, but I don’t know yet. Things are very nascent on this project right now given how early we are into this, hence why there aren’t many details here.

How many queries do you expect to direct to Parsoid simultaneously? Do we know that RESTBase can support this, or that whatever will replace it, can? What do people think about the latency for requests coming from outside of our network (i.e. going to AWS)? Is there any discussion with the RESTBase maintainers or folks thinking about the new backend, regarding bulk requests for rendered HTML?

We’ve done some talking with RESTBase engineers and since most pages everything is cached already, it seems like the infrastructure should hold. We do know that some technology organizations use this endpoint as well so we’ll actually be able to divert some of the traffic onto our new system eventually. The emphasis of this build, as it will be used at scale, is to keep our systems sliding so that we can maximize the endpoint without pushing it over. Talked with @Pchelolo about the new backend and how it would affect things for us. We are currently getting ‘429 errors’ on the below endpoints when we surpass certain query limits. Petr didn’t think it was necessarily a latency issue on RESTBase and a CDN error - perhaps @BBlack or @ema know more about this and how we might be able to work with it? I'll create a subticket for this as well, or so I'm told that is the correct thing to do here for this conversation :-P

https://en.wikipedia.org/api/rest_v1/page/title/ and https://en.wikipedia.org/api/rest_v1/page/html/

What do we do about pages that aren't available in RESTBase because they took too long to parse?

This is a good point and a considered edge case to our current approach. We need to do more digging, but our understanding right now is that when an endpoint doesn’t have a result, it re-triggers the parsoid conversion again? Is this correct?

From where do we plan to serve these dumps?

Do we expect to use AWS on a permanent basis? Obviously this can't be answered here but are there discussions about budgeting for this?

I don’t know actually where the long term plans sit here from budget perspective. Corey and Grant would be best ones to speak about this and I know that budgeting for next year is a current focus for leadership. But that budgeting is probably out of scope for this phab task. @Joe touches on your question as well.

What format will these dumps be in? HTML for each page, sure, but how will one page be separated from the next? What compression will be used? Will large files be split up into smaller ones for ease of download, or do we expect downloaders to want only a single file, or will we provide both?

When I get the engineers in here, I’ll have them answer these specific technical approach questions. Currently we’re just leveraging s3 download capabilities and have yet to hit any bumps around breaking downloads. We’re compressing the HTML and the file currently, but still figuring it out. Sorry can’t be more helpful here at this moment.

How can old dumps be reused? (I know you're thinking about this, just listing it here for completeness.)

We’re using a lot of the dumps around titles, but decided against the approach of converting the full dumps into HTML - that quickly ruled out as a very time expensive option and would stress parsoid itself.

As for RESTBase exporting perspective - theoretically, all HTML for all the pages that managed to render should be stored in RESTBase Cassandra backend. We already have scripts that concurrently request pages from RESTBase via HTTP API. Given that Cassandra is replicated multi-DC, we can expect codfw to have more-or-less the same state as eqiad, so we can utilize RESTBase nodes in codfw for doing the dumps. The additional READ load shouldn't be a problem - this is pretty much a roundtrip into Cassandra, so it's pretty quick. Reading directly from Cassandra would be discouraged - we don't really issue any DELETES there, so some renders of restricted/deleted revisions might exist there, access checks are applied by RESTBase, and we absolutely don't want to leak out deleted and especially suppressed revisions - that would be a major security flaw.

YES! Definitely not hitting Cassandra for exactly these reasons. Using high traffic public cached APIs seems like a good first start with least amount of impact

Hey @RBrounley_WMF just to clarify, when I was talking about dump reuse, I meant your dumps: using the previous run so you don't request all the revs from restbase every time. It's something we'd talked about earlier, which is why i know your folk are thinking about it.

Thanks for the answers, looking forward to more info!

@ArielGlenn - oh great, yeah I misunderstood that. So the first run is obviously expensive on RESTBase to grab all of the pages but we're thinking about listening to Kafka through this endpoint below or something similar. Then just changing it via an upsert type approach using RESTBase only on the changes... @Ottomata, @Milimetric - want to make sure I have this right from our call. For now, we're running these bi-weekly and still designing the second dumps out haha.

https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_recentchange

We’ve done some talking with RESTBase engineers and since most pages everything is cached already, it seems like the infrastructure should hold. We do know that some technology organizations use this endpoint as well so we’ll actually be able to divert some of the traffic onto our new system eventually. The emphasis of this build, as it will be used at scale, is to keep our systems sliding so that we can maximize the endpoint without pushing it over. Talked with @Pchelolo about the new backend and how it would affect things for us. We are currently getting ‘429 errors’ on the below endpoints when we surpass certain query limits. Petr didn’t think it was necessarily a latency issue on RESTBase and a CDN error - perhaps @BBlack or @ema know more about this and how we might be able to work with it? I'll create a subticket for this as well, or so I'm told that is the correct thing to do here for this conversation :-P

Yes, when handling requests coming from AWS and other public clouds, our edge caches deliberately do per-IP ratelimiting of anonymous requests whose responses aren't already cached, in an attempt to shield applications like RESTBase from being overwhelmed by a single user. You can see the code that does so here, although it's in a specialized language called VCL.

we're thinking about listening to Kafka through this endpoint below or something similar

The EventStreams external endpoint will work, but I don't think it should really be used to build reliable and scalable production upserts. If you build the HTML dump generation pipeline internally, you can subscribe to Kafka directly and use some stream processing system (Spark or Flink if you run in Hadoop, or Flink if you want to run in production kubernetes). The search team is building their new WDQS updater on Flink right now (T244590), and will eventually run it in Kubernetes, so you wouldn't have to set all that up from scratch when the time comes. A stream processor would let you do nice things like automatic windowing to wait for ORES scoring or perhaps quick content suppressions before deciding to render a revision.

Or, you could do like change-prop does and consume directly from Kafka. This would be a home grown stream processor, so wouldn't be my recommendation since you are starting from scratch.

The EventStreams external endpoint will work, but I don't think it should really be used to build reliable and scalable production upserts

This is an important point, eventstreams is by no means guaranteed to have a tier-1 availability, nor it is guranteed you will be able to get a working connection (they are throttled/limited)

@RBrounley_WMF can you comment on @Kelson's questions and comments above? How does this workflow interact w/ existing uses of mwoffliner? Are you building scripts that could extend that toolchain? Thanks!

Yep, sorry about the delay here @Sj. @Kelson Interesting, learning about this is interesting. I’d love to learn more about your work and how we might best collaborate with each other and fill some of the technical-gaps. I'll ping you off-phab with some questions once I've done some more reading, and if you're available earlier than your (great-sounding) techtalk I'd love to have a quick video-chat meeting with you. And thank you for your patience whilst I'm digging into the many years of history here!

would dedup be feasible at all if block size was 4k for reducing storage size with .html files for archival purposes for differently dated dumps on the same pool?

would dedup be feasible at all if block size was 4k for reducing storage size with .html files for archival purposes for differently dated dumps on the same pool?

This question was originally raised on the xmldatadumps-l list and you can get a better understanding of the question by reading the thread there: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2020-July/001560.html Originally the discussion was about the sql/xml dumps available for download, but since the requester wants to convert the output to HTML, it seemed like he might want to raise the question here.

Note that the code is now available on line at https://github.com/wikimedia/OKAPI

I can't find any license statement - can that be added please?

Note that the code is now available on line at https://github.com/wikimedia/OKAPI

I can't find any license statement - can that be added please?

@RBrounley_WMF As an FYI, this one is a pretty high priority issue. Was a software license specified in the contract with the vendor?

Hey all - I'm starting to post our sprint overviews here to improve Okapi's dialogue on phabricator. I will add tickets in the Okapi board, feel free to subscribe. First one is at T262476.

Also, we are doing deployments at the end of each sprint in our github repo.

Is there a rough estimate of when these HTML dumps will be available? Is it dependent on when https://enterprise.wikimedia.com/ goes live?

Is there a rough estimate of when these HTML dumps will be available? Is it dependent on when https://enterprise.wikimedia.com/ goes live?

See T273585 for that. Well I suppose that's for availability to the broader public, but as soon as they have urls at all, the info will show up there.