Metrics api response sometimes returns cached 301 (from kubernetes ??)
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	TheDJ
	May 4 2024, 9:00 PM

Description

Steps to replicate the issue (include links if applicable):

curl -v -X 'GET' \ 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03'

What happens?:
Currently for ME, this returns:

> GET /api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03 HTTP/2
> Host: wikimedia.org
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 301
< date: Sat, 04 May 2024 20:04:03 GMT
< server: mw-web.eqiad.main-55b8c76fd7-k745s
< location: https://www.wikimedia.org/wikimedia.org/v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03
< cache-control: max-age=2592000
< expires: Mon, 03 Jun 2024 20:04:03 GMT
< content-length: 311
< content-type: text/html; charset=iso-8859-1
< vary: X-Forwarded-Proto
< age: 2592
< x-cache: cp3068 hit, cp3068 hit/9
< x-cache-status: hit-front
< server-timing: cache;desc="hit-front", host;desc="cp3068"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
< nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
< set-cookie: WMF-Last-Access=04-May-2024;Path=/;HttpOnly;secure;Expires=Wed, 05 Jun 2024 12:00:00 GMT
< set-cookie: WMF-Last-Access-Global=04-May-2024;Path=/;Domain=.wikimedia.org;HttpOnly;secure;Expires=Wed, 05 Jun 2024 12:00:00 GMT
< x-client-ip: 217.159.212.51
< set-cookie: GeoIP=EE:37:Tallinn:59.44:24.74:v4; Path=/; secure; Domain=.wikimedia.org
< set-cookie: NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600

Note: server mw-web.eqiad.main-55b8c76fd7-k745s, a cache hit on cp3068
For @Jdlrobson this returns: "envoy" and succeeds.

What should have happened instead?:
Should have received the results form that day, instead of a 301 redirecting to a BTW broken destination, which also has a caching period of 30 days.

Suspicion:
This page was first requested on the 4th of may. On the 4th the dataset might not YET have been available?.
Possibly this unavailability returns this 301 to redirect to a wikimedia.org 404 ? And this 301 response got cached for 30 days (if you happen to hit the datacenter/edge layer/caching webserver somewhere in between) and you will be unable to get the data for 30 days ?

Interestingly enough however, trying something in the future for me right now, returns 404 with application/json contents and does NOT redirect and i do get the response from server: envoy. Possibly something with these 404s in the new k8s hosts infra ?

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Theklan	T363142 Create a grid Main Page for euwiki that anyone can copy
Resolved		Theklan	T364207 Build a "Explore" section for Basque Wikipedia main page
Resolved	BUG REPORT	EChukwukere-WMF	T364253 Metrics api response sometimes returns cached 301 (from kubernetes ??)

Event Timeline

TheDJ created this task.May 4 2024, 9:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 4 2024, 9:00 PM

TheDJ added a project: Data Products.May 4 2024, 9:06 PM

TheDJ renamed this task from Metrics api response sometimes returns cached error (from kubernetes ??) to Metrics api response sometimes returns cached 301 (from kubernetes ??).May 4 2024, 9:08 PM

TheDJ updated the task description. (Show Details)

TheDJ mentioned this in T364207: Build a "Explore" section for Basque Wikipedia main page.May 5 2024, 11:49 AM

@TheDJ is this issue blocking for you? We will investigate next week.

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 14); removed Data Products.May 13 2024, 3:46 PM

Theklan added a parent task: T364207: Build a "Explore" section for Basque Wikipedia main page.Tue, May 14, 11:18 AM

@VirginiaPoundstone it's not blocking me (i just did the investigation), it's blocking Theklan

@Theklan we will try to investigate this end of next week. Please share any relevant details about what this blocks and urgency. It will help us prioritize appropriately.

I don't know the details on how much this blocks the deploy of the "Explore" section in the parent task. We are planning to deliver the new Grid Main Page in a couple of weeks, when we reach the 33rd Wikipedia with most articles and the 17th in the List of articles every Wikipedia should have. I guess we can deploy it without the "Explore" section, but this was a highlight of the new design.

@EChukwukere-WMF please try to reproduce this error and see what you find out. Thank you!

@VirginiaPoundstone @TheDJ interesting.. I am getting a 200 status code and proper JSON response ( its large) body

import requests

prod_url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03'

header = {"accept": "application/json",
          "user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'}

response = requests.get(prod_url, headers=header)

print(response.status_code)
print(response.json())

Response

status code: 200

EChukwukere-WMF moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 14) board.Thu, May 30, 4:40 AM

thanks @EChukwukere-WMF, @Theklan are you still able to reproduce the issue on your side?

In T364253#9844970, @EChukwukere-WMF wrote:

@VirginiaPoundstone @TheDJ interesting.. I am getting a 200 status code and proper JSON response ( its large) body

I note that I was using http2, not sure if python uses http2 by default. Also it's been almost 30 days, the chance that the url survived for 30 days in the cache (of all routes that potentially ingress) is of course pretty low. We already knew that from the hackathon we had two different ingress routes in play

I very much suspect that this is/was a corner case in the proxy edge or https layers, either with a specific host, or a very specific situation (host restart?). I'm mostly wondering what generated that 301.

It went from /api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03
to /wikimedia.org/v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03

so what hosts would modify that first part of the url ? What has the knowledge wikimedia.org/v1 ? That seems like a rewrite to an internal path, that accidentally got exposed to external ?

EChukwukere-WMF added subscribers: Sfaci, mforns.Fri, May 31, 4:48 AM

@Sfaci @mforns can you help answer the question from @TheDJ ?

EChukwukere-WMF moved this task from In Process to Testing on the Data Products (Data Products Sprint 14) board.Fri, May 31, 4:56 AM

I have never seen 301 when requesting anything to any AQS services. I have no idea about why that happened. In my case, all request responses include envoy as the server value even when a there was a hit on the cache.
If the data were not available yet we should see a 404 (that could be cached as well). It seems you were able to skip envoy somehow and reached the wrong destination. Is mw-web.eqiad.main-55b8c76fd7-k745s a mediawiki pod?

According to https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Updates_and_backfilling data is loaded at the end of the specific timespan (next hour or day for example). But in any case case data could be still unavailable because it seems loading time could tale 24 hours. Anyway, In that case, you should receive a 404 response.

There are different cache hosts depending on were we live, right? Maybe it's something we cannot reproduce from where we are (in my case at least). Anyway the caching period for AQS should be 14400 (4 hours).

I'm sorry. For now I have no idea about this issue or why it could have happened.

@hnowlan Could you take a look at this? Any ideas?

Sfaci added a subscriber: hnowlan.Fri, May 31, 12:23 PM

Sfaci added a project: serviceops.Fri, May 31, 12:33 PM

In T364253#9849840, @Sfaci wrote:

I have never seen 301 when requesting anything to any AQS services. I have no idea about why that happened. In my case, all request responses include envoy as the server value even when a there was a hit on the cache.

That makes sense. Envoy is the building block behind both the service mesh/proxy, the api/rest gateway and the k8s ingress gateway. You should be seeing it almost anywhere in our infra these days.

If the data were not available yet we should see a 404 (that could be cached as well). It seems you were able to skip envoy somehow and reached the wrong destination. Is mw-web.eqiad.main-55b8c76fd7-k745s a mediawiki pod?

It is. But this hasn't anything to do with envoy but probably some routing logic, either in the CDN layer or the api-gateway layer.

According to https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Updates_and_backfilling data is loaded at the end of the specific timespan (next hour or day for example). But in any case case data could be still unavailable because it seems loading time could tale 24 hours. Anyway, In that case, you should receive a 404 response.

There are different cache hosts depending on were we live, right? Maybe it's something we cannot reproduce from where we are (in my case at least). Anyway the caching period for AQS should be 14400 (4 hours).

The following should allow you to iterate across all CDN and control for "where I live" factor.

for i in eqsin eqiad codfw esams ulsfo drmrs ; do curl -v --resolve wikimedia.org:443:$(host text-lb.${i}.wikimedia.org | grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}') -X GET 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03' ; done

when the Brazil PoP is added, add magru after drmrs to include that one too.

I just ran the above and there are no differences between all the PoPs for that URL.

I'm sorry. For now I have no idea about this issue or why it could have happened.

If we can have a reproduction it would be actionable. Otherwise, I 'd suggest to resolve this and reopen if something shows up again.

For what is worth, the CDN's max TTL is 24 hours.

Thank you @akosiaris for looking into this.

@TheDJ and @Theklan I will close this as resolved. If the issue appears again, please let us know.

yes this can be marked as resolved. Thanks all

Metrics api response sometimes returns cached 301 (from kubernetes ??)Closed, ResolvedPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

Metrics api response sometimes returns cached 301 (from kubernetes ??)
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...