Page MenuHomePhabricator

large amount of traffic to the action=parse API from MWOffliner
Open, Needs TriagePublic

Description

According to https://w.wiki/663g (note: sample rate is 1:128), MWOffliner hits us with about 7000 requests per minute for the action=parse API, apparently to retrieve a rendered version of a page for offline use, along with associated meta data. That's roughly 1/4 of the traffic on that API (total is 31k)

This is not an immediate problem, but seems rather inefficient, both for us and for them. It also causes a lot of writes to the parser cache which may otherwise not be needed (*).

This seems like a good use case for MWE, or at least the new page/{title}/html REST API.

(*) needs more investigation; the rate of ParserCache writes caused by api_parse is 18k/minute, that is 1/3 of total cache writes. It's not quite clear what percentage of these are caused by the 7000 requests from MWOffliner. It seems likely that many of them would be cache hits, rather than resulting in a cache miss+write.

Event Timeline

Would that https://github.com/openzim/mwoffliner/issues/1664 fix the issue, so far we are really not familiar with this new API end-point?

I don't think it would fix the issue. The issue is that you shouldn't hit our API for every page every month (It's not causing issues but it is extremely inefficient)

@Ladsgroup The MWoffliner scraper has already been quite optimised over years. I have no obvious improvement in mind for the moment but we will consider any concrete new proposal.

I can think of several (I don't know the details of your system and might have missed something):

  • Use Wikimedia Enterprise (@RBrounley_WMF in this ticket is PM of WME), I think your case can be easily justified for a free tier: https://dumps.wikimedia.org/other/enterprise_html/
  • If not, Incremental dumps, only rebuild for pages that have been edited in the past month and reuse the html from the previous dump. This might exclude template changes in some cases but it's a good compromise.
  • If not, simply avoid scraping pages from commons, they have 100M files

(Do you skip redirects?)

I can think of several (I don't know the details of your system and might have missed something):

This is a topic we had in discussion with Carol earlier this year before she left. Anyway, ready to resume the talk on this with who is now in charge.

  • If not, Incremental dumps, only rebuild for pages that have been edited in the past month and reuse the html from the previous dump. This might exclude template changes in some cases but it's a good compromise.

In a way or the other, you need a cache to store the last version. The current approach is that using the Wikimedia cache was/is the best thing to do. But maybe there is a way to use it in a smarter way?

  • If not, simply avoid scraping pages from commons, they have 100M files

We don't do that (scraping full commons) AFAIK, and actually we have a dedicated cache for pictures/medias already.

(Do you skip redirects?)

List of redirects (article ids) are retrieved for each articles and put then as redirect in the snapshots, and we don't make an HTML scraping for each redirect.

From a hight level POV it might be beneficial for both sides to have a tighter collaboration around the API and the way how MWoffliner uses it. I'm available for a call any time if needed.

In a way or the other, you need a cache to store the last version. The current approach is that using the Wikimedia cache was/is the best thing to do. But maybe there is a way to use it in a smarter way?

Loading content in a way that makes good use of caches is definitly important. However, it seems like you are hitting a lot of ache misses, causing a reparse and cache write (this isn't quite confirmed - a lot of cache misses are coming from that API module, and you are hitting it a lot). Can you tell me what parameters you are passin to action=parse?

Have you experimente with using Parsoid output, instead of the output from the old parser? I am asking because we are working on a REST endpoint for fetching parsoid HTML (ready for experimentation, but caching is not up for full load yet!). Also, MWE would be providing you with HTML coming from parsoid.

Would that https://github.com/openzim/mwoffliner/issues/1664 fix the issue, so far we are really not familiar with this new API end-point?

That's an unrelated issue, this ticket about MWOffliner hitting action=parse.

7000 rpm is about 300 million per month. Wikistats says we have about that many content-space pages in total, but that includes Wikidata (100M) and Commons (90M). Without Wikidata items and Commons file pages, and only parsing articles once a month, the request rate should be about the third of what it is.

If not, Incremental dumps, only rebuild for pages that have been edited in the past month and reuse the html from the previous dump. This might exclude template changes in some cases but it's a good compromise.

Not really realistic IMO. Templates change, linked Wikidata items change, the rendering software itself changes. Ignoring those changes is not at all a good compromise. You could handle templates and Wikidata by recording page_touched and only reparsing what it changes, I think? But you'd still want to reparse all articles once in a while because of changes in MediaWiki or in MWOffliner.

But yeah, using Enterprise HTML dumps would probably be the best approach, from both sides (it would simplify things on the MWOffliner end as well, shorter dump generation times, no need to handle API errors etc).

At MWoffliner we have started to study the problem. Because of this ticket but as well because our unstable experience with the usage of this end-point. See https://github.com/openzim/mwoffliner/issues/1730. Hopefuly Wikimedia does provide other better end-points allowing us to retrieve the same information.

I'm in touch with Ryan meanwhile for using Enterprise backend and we will meet in February.

Hello - good news here! @Kelson and I met last week and discussed the opportunity to move some of the MWOffliner systems to Wikimedia Enterprise HTML dumps. On the onset, it seems feasible...but we need to discuss and dive deeper before having a clear answer or timeline. It seems the initial two areas of concern are our use of Parsoid web html instead of mobile html and information on the namespaces we cover.

Kelson made an issue ticket on Kiwix source code here and I spawned T329779 to track this investigation. Thank you @Kelson, I appreciate your flexibility here and excited to see this work...I'm about to head on holiday and subscribed @HShaikh to oversee this until I'm back.

The using the Enterprise streams is not a good fit (T329779), Kiwix could also start using the new(ish) REST API for fetching HTML from MW core: https://api.wikimedia.org/core/v1/wikipedia/en/page/Earth/with_html (docs at https://api.wikimedia.org/wiki/API_reference/Core/Pages/Get_HTML).

But before you hit these APIs at full throttle, please talk to us. They have note been exposed to high load yet, it may turn out that we need to tweak the configuration.

Currently Kiwix hits the API twice for every page: once to fetch the list of module dependencies (modules|jsconfigvars|headhtml), and then once to fetch the parsoid API (current via the MCS mobile-sections API, unfortunately, since it will be deprecated soon). A patchset I have in progress will add prop=text and &parsoid=1 to the parse API query to allow Kiwix to fetch the Parsoid HTML in the same query as the modules/css etc. So that will probably make the parse API hits roughly the same, but will remove the MCS load.

One of the reasons that it hits production to the size of 100s millions of times a month is that it's hitting the infra three times for each page because of different flavors of kiwix. A simple improvement on both side would be to produce all three at the same time while hitting the infra once.

On top of that as I said, diffing between previous run and current run would make a big difference (e.g. check if page_touched has changed since the last run and reuse the same value from previous dump if page_touched hasn't changed).

This is putting a lot of pressure on the infra and if this continues, I will ratelimit kiwix heavily.

@cscott , could you explain distinguishes between parsoid=1 and useparsoid=1?
I noticed that you applied useparsoid for Kiwix here but this parameter doesn't seem to work as expected, I got warning Unrecognized parameter: useparsoid
Example of the request: https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=modules|jsconfigvars|headhtml|text&useparsoid=1&page=Barak_Obama