Page MenuHomePhabricator

Metrics api response sometimes returns cached 301 (from kubernetes ??)
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
Currently for ME, this returns:

> GET /api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03 HTTP/2
> Host: wikimedia.org
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 301
< date: Sat, 04 May 2024 20:04:03 GMT
< server: mw-web.eqiad.main-55b8c76fd7-k745s
< location: https://www.wikimedia.org/wikimedia.org/v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03
< cache-control: max-age=2592000
< expires: Mon, 03 Jun 2024 20:04:03 GMT
< content-length: 311
< content-type: text/html; charset=iso-8859-1
< vary: X-Forwarded-Proto
< age: 2592
< x-cache: cp3068 hit, cp3068 hit/9
< x-cache-status: hit-front
< server-timing: cache;desc="hit-front", host;desc="cp3068"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
< nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
< set-cookie: WMF-Last-Access=04-May-2024;Path=/;HttpOnly;secure;Expires=Wed, 05 Jun 2024 12:00:00 GMT
< set-cookie: WMF-Last-Access-Global=04-May-2024;Path=/;Domain=.wikimedia.org;HttpOnly;secure;Expires=Wed, 05 Jun 2024 12:00:00 GMT
< x-client-ip: 217.159.212.51
< set-cookie: GeoIP=EE:37:Tallinn:59.44:24.74:v4; Path=/; secure; Domain=.wikimedia.org
< set-cookie: NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600

Note: server mw-web.eqiad.main-55b8c76fd7-k745s, a cache hit on cp3068
For @Jdlrobson this returns: "envoy" and succeeds.

What should have happened instead?:
Should have received the results form that day, instead of a 301 redirecting to a BTW broken destination, which also has a caching period of 30 days.

Suspicion:
This page was first requested on the 4th of may. On the 4th the dataset might not YET have been available?.
Possibly this unavailability returns this 301 to redirect to a wikimedia.org 404 ? And this 301 response got cached for 30 days (if you happen to hit the datacenter/edge layer/caching webserver somewhere in between) and you will be unable to get the data for 30 days ?

Interestingly enough however, trying something in the future for me right now, returns 404 with application/json contents and does NOT redirect and i do get the response from server: envoy. Possibly something with these 404s in the new k8s hosts infra ?

Event Timeline

TheDJ renamed this task from Metrics api response sometimes returns cached error (from kubernetes ??) to Metrics api response sometimes returns cached 301 (from kubernetes ??).May 4 2024, 9:08 PM
TheDJ updated the task description. (Show Details)
TheDJ updated the task description. (Show Details)

@TheDJ is this issue blocking for you? We will investigate next week.

@VirginiaPoundstone it's not blocking me (i just did the investigation), it's blocking Theklan

@Theklan we will try to investigate this end of next week. Please share any relevant details about what this blocks and urgency. It will help us prioritize appropriately.

I don't know the details on how much this blocks the deploy of the "Explore" section in the parent task. We are planning to deliver the new Grid Main Page in a couple of weeks, when we reach the 33rd Wikipedia with most articles and the 17th in the List of articles every Wikipedia should have. I guess we can deploy it without the "Explore" section, but this was a highlight of the new design.

VirginiaPoundstone triaged this task as Medium priority.
VirginiaPoundstone edited projects, added AQS2.0; removed Metrics Platform Backlog.
VirginiaPoundstone added a subscriber: WDoranWMF.

@EChukwukere-WMF please try to reproduce this error and see what you find out. Thank you!

@VirginiaPoundstone @TheDJ interesting.. I am getting a 200 status code and proper JSON response ( its large) body

import requests

prod_url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03'

header = {"accept": "application/json",
          "user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'}

response = requests.get(prod_url, headers=header)

print(response.status_code)
print(response.json())

Response

status code: 200

thanks @EChukwukere-WMF, @Theklan are you still able to reproduce the issue on your side?

@VirginiaPoundstone @TheDJ interesting.. I am getting a 200 status code and proper JSON response ( its large) body

I note that I was using http2, not sure if python uses http2 by default. Also it's been almost 30 days, the chance that the url survived for 30 days in the cache (of all routes that potentially ingress) is of course pretty low. We already knew that from the hackathon we had two different ingress routes in play

I very much suspect that this is/was a corner case in the proxy edge or https layers, either with a specific host, or a very specific situation (host restart?). I'm mostly wondering what generated that 301.

It went from /api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03
to /wikimedia.org/v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03

so what hosts would modify that first part of the url ? What has the knowledge wikimedia.org/v1 ? That seems like a rewrite to an internal path, that accidentally got exposed to external ?

I have never seen 301 when requesting anything to any AQS services. I have no idea about why that happened. In my case, all request responses include envoy as the server value even when a there was a hit on the cache.
If the data were not available yet we should see a 404 (that could be cached as well). It seems you were able to skip envoy somehow and reached the wrong destination. Is mw-web.eqiad.main-55b8c76fd7-k745s a mediawiki pod?

According to https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Updates_and_backfilling data is loaded at the end of the specific timespan (next hour or day for example). But in any case case data could be still unavailable because it seems loading time could tale 24 hours. Anyway, In that case, you should receive a 404 response.

There are different cache hosts depending on were we live, right? Maybe it's something we cannot reproduce from where we are (in my case at least). Anyway the caching period for AQS should be 14400 (4 hours).

I'm sorry. For now I have no idea about this issue or why it could have happened.

@hnowlan Could you take a look at this? Any ideas?

I have never seen 301 when requesting anything to any AQS services. I have no idea about why that happened. In my case, all request responses include envoy as the server value even when a there was a hit on the cache.

That makes sense. Envoy is the building block behind both the service mesh/proxy, the api/rest gateway and the k8s ingress gateway. You should be seeing it almost anywhere in our infra these days.

If the data were not available yet we should see a 404 (that could be cached as well). It seems you were able to skip envoy somehow and reached the wrong destination. Is mw-web.eqiad.main-55b8c76fd7-k745s a mediawiki pod?

It is. But this hasn't anything to do with envoy but probably some routing logic, either in the CDN layer or the api-gateway layer.

According to https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Updates_and_backfilling data is loaded at the end of the specific timespan (next hour or day for example). But in any case case data could be still unavailable because it seems loading time could tale 24 hours. Anyway, In that case, you should receive a 404 response.

There are different cache hosts depending on were we live, right? Maybe it's something we cannot reproduce from where we are (in my case at least). Anyway the caching period for AQS should be 14400 (4 hours).

The following should allow you to iterate across all CDN and control for "where I live" factor.

for i in eqsin eqiad codfw esams ulsfo drmrs ; do curl -v --resolve wikimedia.org:443:$(host text-lb.${i}.wikimedia.org | grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}') -X GET 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/eu.wikipedia.org/all-access/2024/05/03' ; done

when the Brazil PoP is added, add magru after drmrs to include that one too.

I just ran the above and there are no differences between all the PoPs for that URL.

I'm sorry. For now I have no idea about this issue or why it could have happened.

If we can have a reproduction it would be actionable. Otherwise, I 'd suggest to resolve this and reopen if something shows up again.

For what is worth, the CDN's max TTL is 24 hours.

Thank you @akosiaris for looking into this.

@TheDJ and @Theklan I will close this as resolved. If the issue appears again, please let us know.

yes this can be marked as resolved. Thanks all