Page MenuHomePhabricator

large amount of traffic to the action=parse API from MWOffliner
Open, Needs TriagePublic

Description

According to https://w.wiki/663g (note: sample rate is 1:128), MWOffliner hits us with about 7000 requests per minute for the action=parse API, apparently to retrieve a rendered version of a page for offline use, along with associated meta data. That's roughly 1/4 of the traffic on that API (total is 31k)

This is not an immediate problem, but seems rather inefficient, both for us and for them. It also causes a lot of writes to the parser cache which may otherwise not be needed (*).

This seems like a good use case for MWE, or at least the new page/{title}/html REST API.

(*) needs more investigation; the rate of ParserCache writes caused by api_parse is 18k/minute, that is 1/3 of total cache writes. It's not quite clear what percentage of these are caused by the 7000 requests from MWOffliner. It seems likely that many of them would be cache hits, rather than resulting in a cache miss+write.

Event Timeline

Would that https://github.com/openzim/mwoffliner/issues/1664 fix the issue, so far we are really not familiar with this new API end-point?

I don't think it would fix the issue. The issue is that you shouldn't hit our API for every page every month (It's not causing issues but it is extremely inefficient)

@Ladsgroup The MWoffliner scraper has already been quite optimised over years. I have no obvious improvement in mind for the moment but we will consider any concrete new proposal.

I can think of several (I don't know the details of your system and might have missed something):

  • Use Wikimedia Enterprise (@RBrounley_WMF in this ticket is PM of WME), I think your case can be easily justified for a free tier: https://dumps.wikimedia.org/other/enterprise_html/
  • If not, Incremental dumps, only rebuild for pages that have been edited in the past month and reuse the html from the previous dump. This might exclude template changes in some cases but it's a good compromise.
  • If not, simply avoid scraping pages from commons, they have 100M files

(Do you skip redirects?)

I can think of several (I don't know the details of your system and might have missed something):

This is a topic we had in discussion with Carol earlier this year before she left. Anyway, ready to resume the talk on this with who is now in charge.

  • If not, Incremental dumps, only rebuild for pages that have been edited in the past month and reuse the html from the previous dump. This might exclude template changes in some cases but it's a good compromise.

In a way or the other, you need a cache to store the last version. The current approach is that using the Wikimedia cache was/is the best thing to do. But maybe there is a way to use it in a smarter way?

  • If not, simply avoid scraping pages from commons, they have 100M files

We don't do that (scraping full commons) AFAIK, and actually we have a dedicated cache for pictures/medias already.

(Do you skip redirects?)

List of redirects (article ids) are retrieved for each articles and put then as redirect in the snapshots, and we don't make an HTML scraping for each redirect.

From a hight level POV it might be beneficial for both sides to have a tighter collaboration around the API and the way how MWoffliner uses it. I'm available for a call any time if needed.

In a way or the other, you need a cache to store the last version. The current approach is that using the Wikimedia cache was/is the best thing to do. But maybe there is a way to use it in a smarter way?

Loading content in a way that makes good use of caches is definitly important. However, it seems like you are hitting a lot of ache misses, causing a reparse and cache write (this isn't quite confirmed - a lot of cache misses are coming from that API module, and you are hitting it a lot). Can you tell me what parameters you are passin to action=parse?

Have you experimente with using Parsoid output, instead of the output from the old parser? I am asking because we are working on a REST endpoint for fetching parsoid HTML (ready for experimentation, but caching is not up for full load yet!). Also, MWE would be providing you with HTML coming from parsoid.

Would that https://github.com/openzim/mwoffliner/issues/1664 fix the issue, so far we are really not familiar with this new API end-point?

That's an unrelated issue, this ticket about MWOffliner hitting action=parse.

7000 rpm is about 300 million per month. Wikistats says we have about that many content-space pages in total, but that includes Wikidata (100M) and Commons (90M). Without Wikidata items and Commons file pages, and only parsing articles once a month, the request rate should be about the third of what it is.

If not, Incremental dumps, only rebuild for pages that have been edited in the past month and reuse the html from the previous dump. This might exclude template changes in some cases but it's a good compromise.

Not really realistic IMO. Templates change, linked Wikidata items change, the rendering software itself changes. Ignoring those changes is not at all a good compromise. You could handle templates and Wikidata by recording page_touched and only reparsing what it changes, I think? But you'd still want to reparse all articles once in a while because of changes in MediaWiki or in MWOffliner.

But yeah, using Enterprise HTML dumps would probably be the best approach, from both sides (it would simplify things on the MWOffliner end as well, shorter dump generation times, no need to handle API errors etc).

At MWoffliner we have started to study the problem. Because of this ticket but as well because our unstable experience with the usage of this end-point. See https://github.com/openzim/mwoffliner/issues/1730. Hopefuly Wikimedia does provide other better end-points allowing us to retrieve the same information.

I'm in touch with Ryan meanwhile for using Enterprise backend and we will meet in February.

Hello - good news here! @Kelson and I met last week and discussed the opportunity to move some of the MWOffliner systems to Wikimedia Enterprise HTML dumps. On the onset, it seems feasible...but we need to discuss and dive deeper before having a clear answer or timeline. It seems the initial two areas of concern are our use of Parsoid web html instead of mobile html and information on the namespaces we cover.

Kelson made an issue ticket on Kiwix source code here and I spawned T329779 to track this investigation. Thank you @Kelson, I appreciate your flexibility here and excited to see this work...I'm about to head on holiday and subscribed @HShaikh to oversee this until I'm back.

The using the Enterprise streams is not a good fit (T329779), Kiwix could also start using the new(ish) REST API for fetching HTML from MW core: https://api.wikimedia.org/core/v1/wikipedia/en/page/Earth/with_html (docs at https://api.wikimedia.org/wiki/API_reference/Core/Pages/Get_HTML).

But before you hit these APIs at full throttle, please talk to us. They have note been exposed to high load yet, it may turn out that we need to tweak the configuration.

Currently Kiwix hits the API twice for every page: once to fetch the list of module dependencies (modules|jsconfigvars|headhtml), and then once to fetch the parsoid API (current via the MCS mobile-sections API, unfortunately, since it will be deprecated soon). A patchset I have in progress will add prop=text and &parsoid=1 to the parse API query to allow Kiwix to fetch the Parsoid HTML in the same query as the modules/css etc. So that will probably make the parse API hits roughly the same, but will remove the MCS load.

One of the reasons that it hits production to the size of 100s millions of times a month is that it's hitting the infra three times for each page because of different flavors of kiwix. A simple improvement on both side would be to produce all three at the same time while hitting the infra once.

On top of that as I said, diffing between previous run and current run would make a big difference (e.g. check if page_touched has changed since the last run and reuse the same value from previous dump if page_touched hasn't changed).

This is putting a lot of pressure on the infra and if this continues, I will ratelimit kiwix heavily.

@cscott , could you explain distinguishes between parsoid=1 and useparsoid=1?
I noticed that you applied useparsoid for Kiwix here but this parameter doesn't seem to work as expected, I got warning Unrecognized parameter: useparsoid
Example of the request: https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=modules|jsconfigvars|headhtml|text&useparsoid=1&page=Barak_Obama

For the record, we are in the process of implementing @cscott suggestion of one single call to action=parse to retrieve both Parsoid HTML and modules

We are also still discussing with Wikimedia Enterprise, but next steps are still to confirm feasibility (Wikimedia Enterprise is working on product discovery to confirm this)

Shall we close this issue? Time has passed and amount of traffic seems to be sustainable for now given it has been quite a long time now.

We are continuing to improve mwoffliner. Technical assesment of WME will be done in September. I do not see something actionable left in this issue.

@daniel Hi Daniel, I think you are mostly entitled as you had open the issue there. Are we back now to normal?

@daniel Hi Daniel, I think you are mostly entitled as you had open the issue there. Are we back now to normal?

Looking at the current numbers, things got worse: we are now seeing nearly 10,000 requests per minute (shared between two versions of MWOffliner). See https://w.wiki/FCsA.

We were always at "normal" - there was no sudden increase, this large amount of traffic seems to be "normal". It's just that that's not a good situation. MWOffliner should not be using this endpoint.

Shall we close this issue? Time has passed and amount of traffic seems to be sustainable for now given it has been quite a long time now.

It's not causing an immediate problem, it just uses a lot of resources on our side that could be used for other things. But ultimately, this isn't for me to decide if this is acceptable or not, thist's up to SRE (e.g. @Ladsgroup) and Content Transform (e.g. @cscott).

We are continuing to improve mwoffliner. Technical assesment of WME will be done in September. I do not see something actionable left in this issue.

The most important question is: can you use the REST API instead? If not, what's missing?

@daniel Hi Daniel, I think you are mostly entitled as you had open the issue there. Are we back now to normal?

Looking at the current numbers, things got worse: we are now seeing nearly 10,000 requests per minute (shared between two versions of MWOffliner). See https://w.wiki/FCsA.

We were always at "normal" - there was no sudden increase, this large amount of traffic seems to be "normal". It's just that that's not a good situation. MWOffliner should not be using this endpoint.

Shall we close this issue? Time has passed and amount of traffic seems to be sustainable for now given it has been quite a long time now.

It's not causing an immediate problem, it just uses a lot of resources on our side that could be used for other things. But ultimately, this isn't for me to decide if this is acceptable or not, thist's up to SRE (e.g. @Ladsgroup) and Content Transform (e.g. @cscott).

We currently don't have an issue but 1- if we get spikes, etc.and then it might turn into issues 2- this is extremely wasteful. If something is not bringing down our production doesn't mean it's not costing us.

Looking at the current numbers, things got worse: we are now seeing nearly 10,000 requests per minute (shared between two versions of MWOffliner). See https://w.wiki/FCsA.

I don't have access to this link unfortunately...

The most important question is: can you use the REST API instead? If not, what's missing?

This has been advised by and discussed with @cscott in https://phabricator.wikimedia.org/T388514 ; short answer: it allows to retrieve article content and article metadata (list of JS/CSSmodules, ...) in one API call to action=parse instead of one API call to rest.php to get article content and one API call to action=parse to get article metadata, which seems to be even worse.

We currently don't have an issue but 1- if we get spikes, etc.and then it might turn into issues 2- this is extremely wasteful. If something is not bringing down our production doesn't mean it's not costing us.

What cost ranges are we speaking about? I imagine it is impossible to get a precise number at all, but depending on the order of magnitude (100$/month, 1000$/month, ...), it could help to advocate a project to significantly revise the scraper behavior (or decide that the infrastructure cost is well below the engineering effort needed).

This has been advised by and discussed with @cscott in https://phabricator.wikimedia.org/T388514 ; short answer: it allows to retrieve article content and article metadata (list of JS/CSSmodules, ...) in one API call to action=parse instead of one API call to rest.php to get article content and one API call to action=parse to get article metadata, which seems to be even worse.

Hm... I wonder if it would be possible to include this information in the annotaed flavor of the HTML we return from the REST API. @cscott what do you think - is it reasonable to have a version of the HTML output that is essentially everything that's in ParserOutput?

Looking at the current numbers, things got worse: we are now seeing nearly 10,000 requests per minute (shared between two versions of MWOffliner). See https://w.wiki/FCsA.

I don't have access to this link unfortunately...

The most important question is: can you use the REST API instead? If not, what's missing?

This has been advised by and discussed with @cscott in https://phabricator.wikimedia.org/T388514 ; short answer: it allows to retrieve article content and article metadata (list of JS/CSSmodules, ...) in one API call to action=parse instead of one API call to rest.php to get article content and one API call to action=parse to get article metadata, which seems to be even worse.

We currently don't have an issue but 1- if we get spikes, etc.and then it might turn into issues 2- this is extremely wasteful. If something is not bringing down our production doesn't mean it's not costing us.

What cost ranges are we speaking about? I imagine it is impossible to get a precise number at all, but depending on the order of magnitude (100$/month, 1000$/month, ...), it could help to advocate a project to significantly revise the scraper behavior (or decide that the infrastructure cost is well below the engineering effort needed).

With the network, the extra overhead of cpus being used, the power consumption, the storage, the depreciation of the parts because of frequent use. I can assure you it's more than $1K a month. Just our ParserCache hosts, the bare devices, cost ~$400K every five years (granted, it's not only for mwoffliner, it's for all of the infra but I'm just saying these stuff are expensive)