Page MenuHomePhabricator

/page/summary/<title> often returns type: 'no-extract'
Closed, ResolvedPublic

Description

We're using the /page/summary/<title> endpoint in the KaiOS-Wikipedia-app to show a preview before navigating to a new article.

We noticed that the endpoint is often returning empty extracts with type: 'no-extract' (examples below). Wondering if we're doing something wrong or if we should be using a different API.

Examples:

Note that some URLs work in the browser but not in the app. Is there anything in the request headers that may affect the response in this way?

Event Timeline

Mholloway subscribed.

I just tried each of the example links in Chromium, and all came back with the expected standard type. But I notice that they're all redirects, which probably isn't a coincidence. How are you requesting these that's resulting in no-extract? Are you using the public REST API endpoints? RESTBase should be taking care of resolving redirects.

Ah, OK, I see that if you are requesting directly from a mobileapps service instance, then the redirects won't be resolved and the type will indeed be no-extract. I recommend using the public REST API summary endpoint.

Hi @Mholloway, thanks for looking into this!

Ah, OK, I see that if you are requesting directly from a mobileapps service instance, then the redirects won't be resolved and the type will indeed be no-extract. I recommend using the public REST API summary endpoint.

Can you elaborate here?

From the KaiOS app (which is basically a web app), we do fetch('https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous').

Strangely, if you go to https://en.wikipedia.org/wiki/Cat and hover over the word "carnivorous" in the first paragraph (with Page Previews enabled) it also gets a "no-extract" response and the popup shows an error message.

Mholloway added a subscriber: Pchelolo.

Ah, OK, I see that if you are requesting directly from a mobileapps service instance, then the redirects won't be resolved and the type will indeed be no-extract. I recommend using the public REST API summary endpoint.

Can you elaborate here?

Sure. All requests to the Wikimedia REST API are handled by RESTBase. When RESTBase does not have the requested resource in storage, it in turn requests the resource from the relevant backing service (mobileapps, in the case of PCS), and then stores the response before responding with it to the initial request.

When RESTBase receives a PCS request for a given title, it resolves any redirects for the title before passing the request on to mobileapps. So when it receives a request for https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous, the redirect is first resolved (e.g., Carnivorous is resolved to Carnivore), http://mobileapps.svc.discovery.wmnet/en.wikipedia.org/v1/page/summary/Carnivore is requested if necessary, and the standard summary response for Carnivore is returned.

The exception to this redirect-resolving behavior is if the query parameter ?redirect=false is appended to the initial request, in which case RESTBase does not attempt to resolve the redirect. When I try making such a request (e.g., https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous?redirect=false), I (somewhat surprisingly) receive a response with an empty body. I actually don't know of any way to reproduce the no-extract behavior you're describing when using the public REST API summary endpoint.

Strangely, if you go to https://en.wikipedia.org/wiki/Cat and hover over the word "carnivorous" in the first paragraph (with Page Previews enabled) it also gets a "no-extract" response and the popup shows an error message.

I can't reproduce this, either. The popup shows the expected preview content for me.

This is all very strange. Adding RESTBase and @Pchelolo for any idea of why this could be happening.

This is happening for me on all the browsers or devices I try it on but not for other people. Can it be wrongly cached near me?

Here's the curl results

$ curl -vvv "https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous"
*   Trying 208.80.154.224...
* TCP_NODELAY set
* Connected to en.wikipedia.org (208.80.154.224) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
*  start date: Nov  8 10:47:06 2019 GMT
*  expire date: Nov 22 07:59:59 2020 GMT
*  subjectAltName: host "en.wikipedia.org" matched cert's "*.wikipedia.org"
*  issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign ECC OV SSL CA 2018
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7faefa806600)
> GET /api/rest_v1/page/summary/Carnivorous HTTP/2
> Host: en.wikipedia.org
> User-Agent: curl/7.54.0
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200 
< cache-control: s-maxage=1209600, max-age=300
< content-language: en
< content-type: application/json; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/Summary/1.4.1"
< vary: Accept-Encoding
< content-location: https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous
< access-control-allow-origin: *
< access-control-allow-methods: GET,HEAD
< access-control-allow-headers: accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding
< access-control-expose-headers: etag
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
< referrer-policy: origin-when-cross-origin
< x-xss-protection: 1; mode=block
< content-security-policy: default-src 'none'; frame-ancestors 'none'
< x-content-security-policy: default-src 'none'; frame-ancestors 'none'
< x-webkit-csp: default-src 'none'; frame-ancestors 'none'
< x-request-id: 49c4dba2-b02f-47e1-88f9-31e33103210c
< server: restbase1025
< date: Tue, 14 Jan 2020 14:56:41 GMT
< x-envoy-upstream-service-time: 82
< x-ats-timestamp: 1579013948
< x-varnish: 473857095 913214852
< age: 74298
< etag: W/"932739206/80397760-3471-11ea-aa69-9fd8dec7efad"
< x-cache: cp1089 hit, cp1083 hit/41
< x-cache-status: hit-front
< server-timing: cache;desc="hit-front"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< set-cookie: WMF-Last-Access=15-Jan-2020;Path=/;HttpOnly;secure;Expires=Sun, 16 Feb 2020 00:00:00 GMT
< set-cookie: WMF-Last-Access-Global=15-Jan-2020;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Sun, 16 Feb 2020 00:00:00 GMT
< x-analytics: https=1;nocookies=1
< x-client-ip: 69.165.196.71
< set-cookie: GeoIP=CA:QC:Montreal:45.50:-73.58:v4; Path=/; secure; Domain=.wikipedia.org
< accept-ranges: bytes
< content-length: 1365
< 
* Connection #0 to host en.wikipedia.org left intact
{"type":"no-extract","title":"Carnivorous","displaytitle":"Carnivorous","namespace":{"id":0,"text":""},"titles":{"canonical":"Carnivorous","normalized":"Carnivorous","display":"Carnivorous"},"pageid":242468,"lang":"en","dir":"ltr","revision":"932739206","tid":"80397760-3471-11ea-aa69-9fd8dec7efad","timestamp":"2019-12-27T23:06:50Z","content_urls":{"desktop":{"page":"https://en.wikipedia.org/wiki/Carnivorous","revisions":"https://en.wikipedia.org/wiki/Carnivorous?action=history","edit":"https://en.wikipedia.org/wiki/Carnivorous?action=edit","talk":"https://en.wikipedia.org/wiki/Talk:Carnivorous"},"mobile":{"page":"https://en.m.wikipedia.org/wiki/Carnivorous","revisions":"https://en.m.wikipedia.org/wiki/Special:History/Carnivorous","edit":"https://en.m.wikipedia.org/wiki/Carnivorous?action=edit","talk":"https://en.m.wikipedia.org/wiki/Talk:Carnivorous"}},"api_urls":{"summary":"https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous","metadata":"https://en.wikipedia.org/api/rest_v1/page/metadata/Carnivorous","references":"https://en.wikipedia.org/api/rest_v1/page/references/Carnivorous","media":"https://en.wikipedia.org/api/rest_v1/page/media/Carnivorous","edit_html":"https://en.wikipedia.org/api/rest_v1/page/html/Carnivorous","talk_page_html":"https://en.wikipedia.org/api/rest_v1/page/html/Talk:Carnivorous"},"extract":"","extract_html":""}

Weird. When I do the same, I get a 302 with header Location: Carnivore.

This may be related to making the request from a different domain with CORS, and the preflight requests and the behavior when a redirect is returned

Yes, it looks related to CORS. Is it possible that

  1. Restbase detects the CORS request and since the redirect is unlikely to work decides to send a "less complete" response as a 200?
  2. The above response gets cached and is then served for non-CORS requests as well?

Here's something I can reliably reproduce:

  1. Pick a page that I know is a redirect but that I have not queried before
  2. Make a non-CORS request to it -> 302 with Location header 👌
  3. Make a CORS request to it -> 200 with type: 'no-extract'
  4. Make a non-CORS request to it -> 200 with type: 'no-extract' 👎

It seems like RESTBase is not resolving the redirects when the request is CORS.

To reproduce, open a domain that is not wikimedia-based, and type in the console in the developer tools:

await (await fetch('https://en.wikipedia.org/api/rest_v1/page/summary/Carnivorous')).json()

You will see on the network tab a 200, instead of a 302 that you see if you visit the url directly in the browser URL.

@Pchelolo thoughts?

I see a 200 in both Safari and Chrome, but the result is Carnivore - the target of the redirect. I believe the browser just follows the redirect by default and not shows it. Or am I supposed to see something else there?

The result that @SBisson is seeing is coming from Varnish:

age: 74298
etag: W/"932739206/80397760-3471-11ea-aa69-9fd8dec7efad"
x-cache: cp1089 hit, cp1083 hit/41

Neither of us seem to be able to reproduce due to our geographic location, as my requests are routed to a different data center.

The revision ID you're seeing 932739206 belongs to Carnivore article. I guess there were some renaming/moving of the article and the Varnish purge got lost.

Strange, before for me Carnivorous redirected with a 302 to Carnivore and now it is a 200 with type no-extract.

What about https://en.wikipedia.org/api/rest_v1/page/summary/Domesticated?

SBisson raised the priority of this task from Medium to High.Jul 28 2020, 2:54 PM

This is happening to me again on the Wikipedia-Preview project. The redirects are not being processed from CORS requests.

Are we making the requests wrong, is it RB, who can help?

https://en.wikipedia.org/api/rest_v1/page/summary/Chinko_Project

@AMooney @Pchelolo from Prod Infra side this sounds to fall on our side. Can we schedule it into Clinic Duty? Seems like a good task for knowledge transfer :) May be even for the skill matrix!

Target for this is within: Q1

https://github.com/wikimedia/restbase/pull/1272 might actually help with this - we resolve redirects internally for CORS requests because there was an issue with browsers handling them. But, we were doing it wrong, making Varnish caching layers go a little crazy. Let's see if the issue is resolved once that pull request it merged/deployed before digging more.

Coincidentally, I've been looking at this one today as well. FWIW, the only way I can reproduce the behavior with a local RESTBase is by including a Cache-Control: no-cache header on the incoming summary request, thereby short-circuiting redirect-resolving here. (Why is it skipped in the case of no-cache requests?) Is it possible that Varnish/ATS are injecting a Cache-Control: no-cache header on incoming CORS requests for some reason?

A case could be made that mobileapps should respond with 400 Bad Request if the requested title is a redirect.

MSantos assigned this task to Pchelolo.
MSantos subscribed.

I can't reproduce this error and it seems that it was fixed on the RESTBase side. I'll be bold and close this task as resolved, but please reopen it in case I'm missing something.