Page MenuHomePhabricator

/feed/onthisday/selected latency is very high
Closed, ResolvedPublic

Description

Since the collection and assembly of RESTBase page summaries for feed responses was moved into Wikifeeds in T263133, the average latency for the /feed/onthisday/selected endpoint has been extremely high, between 35-40 seconds at the time of writing. It's not clear to me why this endpoint in particular would be slow, although at a glance it does seem to collect a rather large number of summaries.

Screenshot from 2020-09-24 17-16-25.png (1×1 px, 269 KB)

Event Timeline

Even more interesting - when requested with curl the endpoint returns a snappy result. However, https://grafana.wikimedia.org/d/000000577/restbase-external-overview?viewPanel=17&orgId=1 indicates that RESTBase observes these latencies as well.

Ok, observing the logs I see quite a lot of timeouts for http://restbase.discovery.wmnet:7231/de.wikipedia.org/v1/page/summary/Datei%3AAdriana_Bisi_Fabbri_%E2%80%93_Aviatore.tiff

Since it's a file page, it's actually stored it commons, thus RESTBase returns a redirect to https://commons.wikimedia.org/api/rest_v1/page/summary/File%3AAdriana_Bisi_Fabbri_%E2%80%93_Aviatore.tiff

Which is an external URI. I would guess that going into the public internet is prohibited for wikifeeds, thus it times out. Since the errors in summary fetching are ignored, this timeout still produces a 200 result, but drives the latency up.

@Joe is my assumption correct - wikifeeds can't go to the public internet?

There's few options on how we can fix it:

  • we can detect that the request is an internal request in RESTBase and return an internal URI.
  • we can manually resolve these redirects in wikifeeds. This is easier and more straightforward. However trying to fix it in RESTBase might be more generic since other services talking to it might have similar issues.

Please advise which solution do you think is better?

This is borderline unbreak-now from the apps teams perspective as it's breaking a key component of the apps (the explore feed) for German users. The specific endpoint the apps use that is timing out is https://de.wikipedia.org/api/rest_v1/feed/featured/2020/09/24 cc @Charlotte @JMinor

This is borderline unbreak-now from the apps teams perspective as it's breaking a key component of the apps (the explore feed) for German users. The specific endpoint the apps use that is timing out is https://de.wikipedia.org/api/rest_v1/feed/featured/2020/09/24 cc @Charlotte @JMinor

I guess I'll just do this:

we can detect that the request is an internal request in RESTBase and return an internal URI.

Mentioned in SAL (#wikimedia-operations) [2020-09-25T02:14:16Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@4eaad8f]: eqiad-only, T263798

Mentioned in SAL (#wikimedia-operations) [2020-09-25T02:20:24Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@4eaad8f]: eqiad-only, T263798 (duration: 06m 09s)

Mentioned in SAL (#wikimedia-operations) [2020-09-25T02:20:39Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@4eaad8f]: new codfw, T263798

Mentioned in SAL (#wikimedia-operations) [2020-09-25T02:29:44Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@4eaad8f]: new codfw, T263798 (duration: 09m 05s)

Pchelolo claimed this task.

The redirect links issue needed to be addressed anyway, but FWIW, I think it was a bug in the service code that was causing a Commons image link to be included in the onthisday selected stories links. I'll file a separate task to follow up on that.