Page MenuHomePhabricator

Follow up on remaining requests to pageviews endpoints
Closed, ResolvedPublic3 Estimated Story Points

Description

There is still some traffic being sent to the AQS1 version of the pageviews endpoint - this traffic is coming from Wikifeeds and Mediawiki via restbase.

Wikifeeds requests pageviews here and will require addition of an extra configuration option to specify using the rest-gateway for these requests as opposed to restbase (which is used elsewhere).

I am unaware as to where in the codebase the requests from mediawiki are coming from. The URLs are mostly of the format /analytics.wikimedia.org/v1/pageviews/per-article/$SITE/all-access/user/$USER/daily/$START/$END

Event Timeline

This is a blocker for us decommissioning AQS1 in general

WDoranWMF subscribed.

Ideally, we would like to complete the Wikifeeds portion of this no later than mid November 2023.

Hmmm, not sure on the MW part. The most likely candidate I saw was the PageViewInfo Extension. Specifically, this bit. From a quick look, I think this extension is responsible for the "Page views in the past 30 days" number on Page Information pages (like https://en.wikipedia.org/w/index.php?title=Earth&action=info). I've only looked very briefly, though so I could be wrong.

Even if I'm right, tracing through config, I don't see anywhere that "analytics.wikimedia.org" shows up. And I didn't find any relevant hits for analytics.wikimedia.org in codesearch (there were a lot of irrelevant ones). So I'm not convinced PageViewInfo is responsible for the hits mentioned in the task description. Even if it isn't, it still might be a placed that needs adjusted, though.

Is there anything else about the requests we can use to narrow things down? Maybe they're only coming from certain clusters or projects or something? (Not sure what data we have to work with.)

What url format should things like PageViewInfo be using instead?

Hmmm, not sure on the MW part. The most likely candidate I saw was the PageViewInfo Extension. Specifically, this bit. From a quick look, I think this extension is responsible for the "Page views in the past 30 days" number on Page Information pages (like https://en.wikipedia.org/w/index.php?title=Earth&action=info). I've only looked very briefly, though so I could be wrong.

Even if I'm right, tracing through config, I don't see anywhere that "analytics.wikimedia.org" shows up. And I didn't find any relevant hits for analytics.wikimedia.org in codesearch (there were a lot of irrelevant ones). So I'm not convinced PageViewInfo is responsible for the hits mentioned in the task description. Even if it isn't, it still might be a placed that needs adjusted, though.

These requests will be going to restbase rather than using the external interface - here's a summarised packet capture of one of the requests:

16:48:39.073218 IP restbase1029.eqiad.wmnet.55912 > aqs.svc.eqiad.wmnet.7232: Flags [P.], seq 193222464:193222835, ack 370198178, win 83, options [nop,nop,TS val 2046820488 ecr 3846908372], length 371
GET /analytics.wikimedia.org/v1/pageviews/per-article/LANG.wikipedia.org/all-access/user/USERNAME/daily/START/END HTTP/1.1
x-client-ip: ::ffff:10.64.16.173
x-request-id: REQID
user-agent: MediaWiki/1.41.0-wmf.30
host: aqs.svc.eqiad.wmnet:7232
accept-encoding: gzip, deflate
content-length: 0
Connection: keep-alive

Is there anything else about the requests we can use to narrow things down? Maybe they're only coming from certain clusters or projects or something? (Not sure what data we have to work with.)

The client requests themselves are coming exclusively from restbase hosts so we can't drill down further into the requesting mw hosts without a lot more acrobatics, but it seems consistently from restbase

What url format should things like PageViewInfo be using instead?

Assuming I'm correct in stating that it's coming from restbase, the exact same URL format can be used, we just need the requests to go to https://rest-gateway.discovery.wmnet on port 4113 instead of https://restbase.discovery.wmnet (or equivalent configuration option).

Thanks Hugh.

I see where wikifeeds is configured to use restbase.discovery.wmnet, but still haven't found anything for MW. I'll look some more, but we might need to call in more help if I'm unable to find it.

Did a little more tracing. If I'm correct (it got pretty hairy), then requests from the PageViewInfo extension that I previously mentioned use this bit of config:

'restbase' => 'http://localhost:6011',

which comes from this mediawiki config file:

https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/4b0703bbb8939300cd64199a522a059533625cef/wmf-config/ProductionServices.php#72

Does that help, or no?

Looking a bit further, in envoy.yaml, I see:

- name: restbase
    port: 6011
    service: restbase-https
    upstream: restbase.discovery.wmnet
    timeout: "30s"
    keepalive: "4s" # this is needed to avoid reusing upstream-closed connections - see T263043, also T266855
    xfp: https # this is needed for private wikis - see T249535

If I'm putting this all together correctly, restbase requests coming from MW are going to localhost:6011, which in turn goes to restbase.discovery.wmnet, and we need to update that .yaml file to direct that to rest-gateway.discovery.wmnet instead. Am I on the right track?

Sorry, I'm turning this task into my personal journey of discovery. But I also see that we're referencing 6011 in fixtures.yaml:

restbase:
    keepalive: 4s
    port: 6011
    timeout: 10s
    xfp: https
    upstream:
      ips:
        - 10.2.1.17/32
        - 10.2.2.17/32
      address: restbase.discovery.wmnet
      port: 7443
      encryption: true

I'm speculating that the envoy.yaml might apply to k8s traffic, and fixtures.yaml might apply to traditional appservers, and we need to update both?

Sorry, I'm turning this task into my personal journey of discovery. But I also see that we're referencing 6011 in fixtures.yaml:

restbase:
    keepalive: 4s
    port: 6011
    timeout: 10s
    xfp: https
    upstream:
      ips:
        - 10.2.1.17/32
        - 10.2.2.17/32
      address: restbase.discovery.wmnet
      port: 7443
      encryption: true

I'm speculating that the envoy.yaml might apply to k8s traffic, and fixtures.yaml might apply to traditional appservers, and we need to update both?

I am fairly sure that appservers and k8s servers use restbase via the service mesh in the same way on the localhost port you have correctly sleuthed out!

fixtures.yaml are just test/templating fixtures and shouldn't need changing. I am fairly sure that we'll need to add another config option just for the pageviews calls rather than replacing the restbase entry as mediawiki is still gonna use restbase for a whole bunch of other stuff that we haven't ported to the rest-gateway yet (similar to wikifeeds!)

Great, it sounds like we're making progress. So to confirm, the change we're considering is:

  • add an additional entry to envoy.yaml
  • add whatever config and code plumbing is necessary to make use of that entry for these specific urls

Does that sound correct?

That all touches a few different repos, so we'll have to make sure we deploy it all in the correct order. But unless I'm missing something, it doesn't sound incredibly difficult or like a major coding project.

My main question now is what to call the new entry. @hnowlan said this:

we'll need to add another config option just for the pageviews calls

That might lead us to a name like "pageviews". But might this potentially apply to other calls that should be routed to AQS 2.0 from MW? Even if there aren't any today, someone might find a need for it tomorrow. So I'm wondering if the name should be a little more generic, something like "aqs2"? I personally don't like including the version number in the name, but there is already an "aqs" entry. Maybe someone has a better naming suggestion? Might this new entry apply to things other than AQS 2.0?

And to be sure, I see there's already a "rest-gateway" entry. That doesn't do what we need, does it? (We'd still need the other changes to make use of that, so even if the rest-gateway entry in envoy.yamlis helpful, it'd just save us a change to that one file.)

Here is a summary of the current code/config flow discussed above, staring in the PageViewInfo extension:

We will therefore need to:

  • find or create an entry in envoy.yaml suitable for these endpoints to use
  • modify ProductionServices.php to reference that entry
  • modify CommonSettings.php to use this value when settings $wgPageViewInfoWikimediaEndpoint

No changes to the extension itself will be required. If we need to create a new entry in envoy.yaml, we should ensure that is deployed before we deploy the config changes that make use of it.

The above is my current understanding. I would be happy to be corrected on anything that I missed or misunderstood.

We'll need to add the rest-gateway listener to the appservers on metal and k8s

We'll need to add the rest-gateway listener to the appservers on metal and k8s

Sorry, I'm not sure what this means. I see various things related to listeners in the operations/deployment-charts and operations/puppet repos. @hnowlan, is this:

  • something that occurs in one or both of those repos?
  • something you'll be responsible for?
  • a prerequisite for the mw-config changes?

Or am I completely off base?

Change 968384 had a related patch set uploaded (by BPirkle; author: BPirkle):

[operations/mediawiki-config@master] Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway

https://gerrit.wikimedia.org/r/968384

Uploaded prospective patch, but marked as Work In Progress until we're sure all necessary prerequisites have been met.

Change 968617 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service_proxy: add rest-gateway to listeners

https://gerrit.wikimedia.org/r/968617

We'll need to add the rest-gateway listener to the appservers on metal and k8s

Sorry, I'm not sure what this means. I see various things related to listeners in the operations/deployment-charts and operations/puppet repos. @hnowlan, is this:

  • something that occurs in one or both of those repos?
  • something you'll be responsible for?
  • a prerequisite for the mw-config changes?

Or am I completely off base?

You're correct on all three there - this change will open the listeners required on the appservers

Change 968617 merged by Hnowlan:

[operations/puppet@production] service_proxy: add rest-gateway to listeners

https://gerrit.wikimedia.org/r/968617

This comment was removed by hnowlan.

Hugh said this on the config change to use the rest gateway:

One caveat to this change - the pageviews endpoint on the rest gateway requires that the Host header be set to wikimedia.org. Would that be easily doable here or is it a larger change?

At first glance, I'm thinking this requires a change to the extension itself, and well as a config change. I don't see a way to tell the request to use a specific Host header, which. means Guzzle will take it from the url. And because the request is constructed by the extension, that's where that change will need to go. However, because it is an extension and potentially used by third parties, we'll need to take the value for that Host header from configuration (as using wikimedia.org would be in appropriate for third parties), presumably falling back to the current behavior if a specific value is not specified (making this a non-change for any such third parties). So I think this means:

  1. add the ability to set a specific Host header value via configuration to the PageViewInfo extension. This will be a new gerrit change, but a reasonably straightforward one.
  2. add the desired value to config (can be part of the current config change)

All that is just from looking through the code - I haven't actually tried anything. Please let me know if you see anything incorrect in all that. Otherwise, I'll try to get that together by early next week.

I was thinking about this a little bit - there is a change I could also make to make this a bit less dogmatic on the rest-gateway side. To explain a little bit further, currently all requests for the pageviews (and other metrics) endpoints should only be answered for wikimedia.org and no other domains, otherwise we'd pollute the cache with answers returning the same data for all of our potential sites. So currently we check the host header for wikimedia.org, which is the right thing to do for external clients, but it opens us up to problems like this internally.

This isn't really sustainable if we need to make changes like the above though, so I will try to come up with an alternative here.

In fact, looking at the change, it *looks* like it should be using wikimedia.org already:

$wmgLocalServices['rest-gateway'] . '/wikimedia.org/v1

And restbase doesn't reply to domains other than wikimedia.org for metrics, so I think we're good if we make the corresponding change in the rest-gateway.

Change 971456 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: change how AQS URLs enforce wikimedia.org domain

https://gerrit.wikimedia.org/r/971456

In fact, looking at the change, it *looks* like it should be using wikimedia.org already:

$wmgLocalServices['rest-gateway'] . '/wikimedia.org/v1

And restbase doesn't reply to domains other than wikimedia.org for metrics, so I think we're good if we make the corresponding change in the rest-gateway.

Just to be sure we're saying the same thing, the /wikimedia.org/ bit is the first path parameter, not the actual host. The host comes from $wmgLocalServices['rest-gateway'], which will have the value http://localhost:6033, meaning the actual url will start with http://localhost:6033/wikimedia.org/v1/, and the Host header will have the value localhost:6033.

It looks to me like the new change you posted takes that into account and everything will work happily without any changes to the extension (cool!). So hopefully this is just me echoing back what you said in different words, and we're in good shape.

Looks like we'll also need a rest-gateway entry in LabsServices.php to make the test happy (and presumably to avoid breaking beta, which is a good thing too...)

Any suggestions on where to point that entry?

The existing "restbase" entry in LabsServices.php is:
'restbase' => 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231',

Is there anything equivalent for AQS 2? Or anything that would be appropriate to use?

Change 971456 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: change how AQS URLs enforce wikimedia.org domain

https://gerrit.wikimedia.org/r/971456

Change 971940 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: fix regexes

https://gerrit.wikimedia.org/r/971940

Change 971940 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: fix regexes

https://gerrit.wikimedia.org/r/971940

Change 971944 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: correct match section for routes

https://gerrit.wikimedia.org/r/971944

Looks like we'll also need a rest-gateway entry in LabsServices.php to make the test happy (and presumably to avoid breaking beta, which is a good thing too...)

Any suggestions on where to point that entry?

The existing "restbase" entry in LabsServices.php is:
'restbase' => 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231',

Is there anything equivalent for AQS 2? Or anything that would be appropriate to use?

Oh this could be a problem. We don't have any instances of AQS2, nor do we have a Kubernetes cluster or a rest-gateway instance in beta. For now can we stay with restbase there? I'm not sure how the restbase instance itself is accessing AQS1 (if it is at all)

Change 971944 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: correct match section for routes

https://gerrit.wikimedia.org/r/971944

Oh this could be a problem. We don't have any instances of AQS2, nor do we have a Kubernetes cluster or a rest-gateway instance in beta. For now can we stay with restbase there? I'm not sure how the restbase instance itself is accessing AQS1 (if it is at all)

Sure, we can stay with that same host. I made a rest-gateway entry in labs config, pointed at the same value that restbase labs is currently using, with a comment that may change in the future if we get a beta version of rest gateway to point it at.

Change 968384 merged by jenkins-bot:

[operations/mediawiki-config@master] Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway

https://gerrit.wikimedia.org/r/968384

Config change to point the PageViewInfo extension at AQS 2.0 has been deployed. If I'm reading this correctly, there was a corresponding drop in internal requests to AQS 1 pageviews (click "internal 2.xx" below the graph to see just the orange line, I couldn't figure out how to link to just that one line on the graph).

Does that indicate success, at least for the portion of the requests coming from PageViewInfo?

Config change to point the PageViewInfo extension at AQS 2.0 has been deployed. If I'm reading this correctly, there was a corresponding drop in internal requests to AQS 1 pageviews (click "internal 2.xx" below the graph to see just the orange line, I couldn't figure out how to link to just that one line on the graph).

Does that indicate success, at least for the portion of the requests coming from PageViewInfo?

That certainly looks likely, thank you! Unfortunately we can't fully verify until we have pageviews routed back to the page-analytics service, but I think we're looking good for the purposes of this ticket. I think we can close for now and reopen if needs be, as wikifeeds is being tracked in another ticket (and is close to resolution also).