Page MenuHomePhabricator

Paragraph breaks returned by text API requests have changed
Open, LowPublic

Description

  1. Example: get text summary of a Wikipedia article using this link
  2. This summary of the artist John Constable consists of 2 paragraphs, the second starting, 'His most famous paintings...', but the text returned by the API has no break, not even a space. Instead we get '...feeling\".His most famous paintings...'
  3. This behaviour started about 2 months ago I think. Before that paragraphs were separated by a line-feed. Not all paragraph breaks are affected, I suspect it is something to do with the reference [2] that ends the first paragraph

Event Timeline

Jdlrobson triaged this task as Low priority.EditedAug 14 2018, 9:31 PM
Jdlrobson subscribed.

@Roblowe1953 is it possible for you to use the summary endpoint for your use case? It has a higher level of support/better performance so if you can I'd advise switching.

e.g. https://en.wikipedia.org/api/rest_v1/page/summary/John_Constable

I've recorded this issue in https://www.mediawiki.org/wiki/Extension:TextExtracts#Caveats for the time being.

Thanks for the report!

Hi, I just ran into the same issue with this, and this is breaking one of my projects which used to work.

Unfortunately, I am unable to switch to the REST api, as it does not support batched calls like TextExtracts API does (ex: https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=extracts|revisions|info&titles=BBC_News|Washington_Post&exlimit=20&exintro=true&explaintext=true&exsectionformat=plain&redirects)

Do you have any timeline on fixing this issue?

We do not plan on fixing this issue as we don't use TextExtracts in this way for any of our maintained projects and we only have a limited number of engineers. We'll however happily code review any patches that are submitted.

In that case, for the REST api, do you have any plans to add batched support for requesting multiple pages at once?

I also see that this call is labeled as being unstable, which makes me nervous. https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title

How often and how wildly will this API be changing?

For now, I suppose I can just split up each page into its own request. but I'm worried about the scalability of my project if we do so.

How well supported will the REST api be when we find bugs? This isn't the first time I've reported a bug with the api, and I imagine it won't be the last.

Pchelolo subscribed.

In that case, for the REST api, do you have any plans to add batched support for requesting multiple pages at once?

There was never an intention of supporting batch requests in REST API and it's not a bug, it's a feature. In our infrastructure we have very efficient CDN network with Varnish caches in front of the actual app servers, so the Varnish cache hit is a lot faster than the Varnish cache miss. For REST API and summary in particular, the difference is less drastic since all the summaries are pre-generated and stored, but it still high.

The Varnish cache is keyed on the resource URI, so by only supporting fetching a single page we are able to increase cache hit ratio by an order of magnitude compared to the action API, and we're able to provide much higher max-age values as we're able to do active purging of stale data.

On top of that, with wider adoption of HTTP/2, the socket you're using for all of these requests for individual items will be reused, so in the end by switching to the REST API for this and requesting all the items one-by-one, you will probably see lower backend latencies, especially from clients located geographically far away from our primary data centers.

I also see that this call is labeled as being unstable, which makes me nervous. https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title

This is an oversite. Since PagePreviews feature on all wikis is now powered by this API, we need to change the docs to stable. I will do that shortly.

On top of that, with wider adoption of HTTP/2, the socket you're using for all of these requests for individual items will be reused, so in the end by switching to the REST API for this and requesting all the items one-by-one, you will probably see lower backend latencies,

Although if clients aren't built (or rebuilt) to send all those one-by-one requests in parallel, they'd likely suffer far more from the added round trips than they'd benefit from reduced backend latency on each one.

especially from clients located geographically far away from our primary data centers.

What does geographical distance have to do with backend latency? Greater geographical distance tends to mean more network latency, which is helped by reduced waiting on round trips.

@Anomie don't get me wrong, I'm not saying the batch requests are bad and that we should abandon them or change how action API works. I'm trying to explain the reasoning why we think batching in the rest API is not a good idea.

Although if clients aren't built (or rebuilt) to send all those one-by-one requests in parallel, they'd likely suffer far more from the added round trips than they'd benefit from reduced backend latency on each one.

That's for sure true, but rest API is new, so all the clients using it are either new built from scratch or rebuilt to use it instead of action API - the switch doesn't happen automagically with no developer involvement. So why wouldn't the developer of the new app use parallel requests?

What does geographical distance have to do with backend latency? Greater geographical distance tends to mean more network latency, which is helped by reduced waiting on round trips.

Sorry, I've phrased that incorrectly. I meant that with much higher Varnish hit ratio we can use CDN much more, lowering the network latency. I've missused the backend latency term, using it from the frontend code perspective as "network latency + true backend latency" = time from sending the request to receiving the response.

In the spirit of not getting people wrong, I'm not trying to convince you to add batching to the REST API. The goal of high cacheability isn't compatible with arbitrary batching, plus the pathinfo-as-positional-parameter style that's used most places by the current REST API gets awkward if you try to make it possible to request more than one resource per request (e.g. "/page/1|3|17|19|200/data") and makes it easy to break the one-url-per-resource rule that positional parameters often give you by default.

That's for sure true, but rest API is new, so all the clients using it are either new built from scratch or rebuilt to use it instead of action API - the switch doesn't happen automagically with no developer involvement. So why wouldn't the developer of the new app use parallel requests?

In this case, we have a user who has an existing action API client who is being pushed to use the REST API because the team that nominally owns TextExtracts doesn't want to continue to maintain the action API module.

For the action API we specifically recommend against parallel requests in mw:API:Etiquette and existing client code is very likely written in a $response = send_http_request( ... ) style rather than using callbacks or some sort of select() loop. A newcomer to the REST API might not know that parallel requests are encouraged there or might not know how to make their HTTP library do it.

I've missused the backend latency term, using it from the frontend code perspective as "network latency + true backend latency" = time from sending the request to receiving the response.

Ok, thanks for clarifying. (:

For the action API we specifically recommend against parallel requests in mw:API:Etiquette and existing client code is very likely written in a $response = send_http_request( ... ) style rather than using callbacks or some sort of select() loop. A newcomer to the REST API might not know that parallel requests are encouraged there or might not know how to make their HTTP library do it.

That's interesting and I did not know that. We have a small rules section in the rest api docs, it might be a good idea to add a section there that parallel requests are indeed encouraged instead of batching. We do get requests for adding batching from time to time, so I will adapt the docs.

One last question about the new REST api (which may slightly outside the scope of this task discussion):

  • for 404 errors that result from the page not being found, how stable is the "type" field containing "https://mediawiki.org/wiki/HyperSwitch/errors/not_found", or the title being set to "Not found." ? I could just do json['type'] === 'https://mediawiki.org/wiki/HyperSwitch/errors/not_found' or json['title'] === 'Not found.', but those feel like pretty fragile checks if the API is ever changed.

ie: is there a stable way of checking from the result-JSON if a page doesn't exist, similar to how TextExtracts could check if the "missing" field existed, or is the only way to check the actual status code?

https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=extracts|revisions|info&titles=asdfasdfdsafsadf&exlimit=20&exintro=true&explaintext=true&exsectionformat=plain&redirects

The reason I ask is that I currently only cache the resulting JSON so I can parse the results over X hours without a new API scrape, so I'd like to know if I need to do additional cache work to store the status code now just to detect missing-pages.

The 404 HTTP status code stands for "Not found". The reasons for that might be either misspelled path or that page indeed doesn't exist, so I think even relying just on a 404 would be fine. On top of that, these types are stable and there's no intention to ever change them.

We do not plan on fixing this issue as we don't use TextExtracts in this way for any of our maintained projects and we only have a limited number of engineers. We'll however happily code review any patches that are submitted.

Where I could take a look at this code, that generates TextExtracts?

Where I could take a look at this code, that generates TextExtracts?

https://gerrit.wikimedia.org/g/mediawiki/extensions/TextExtracts. Or there's a mirror at GitHub if you prefer that site: https://github.com/wikimedia/mediawiki-extensions-TextExtracts

See also https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_become_a_MediaWiki_hacker for general instructions on contributing to MediaWiki code.

Ping @MaxSem as the author. This is probably a quick fix and would be highly appreciated. Sad to hear that the Wikimedia team is lacking resources to fix such rather easy but annoying bugs.

Is there any other API that could subsititute this for the time being that works for whole pages, not only the summary?