Page MenuHomePhabricator

Route to new AQS Knowledge Gaps endpoint
Closed, DeclinedPublic3 Estimated Story Points

Description

Instead of doing this in RESTBase, and to help with sunsetting, let's try using REST Gateway.

  • Knowledge Gaps endpoint:
    • <domain>/api/rest_v1/metrics/knowledge_gap

This should be routing to this AQS endpoint which is now live: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/aqs/+/refs/heads/master/v1/knowledge-gap.yaml

Context for project in T331159

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Currently we're deploying AQS 2 services to the api-gateway rather than the rest-gateway - would it be suitable to route the service there so as to enable a migration in future should there be an AQS2 service built to power this endpoint?

Change 943616 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/dns@master] wmnet: add discovery record for aqs

https://gerrit.wikimedia.org/r/943616

Change 944888 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: add discovery for AQS

https://gerrit.wikimedia.org/r/944888

Change 944888 merged by Hnowlan:

[operations/puppet@production] service: add discovery for AQS

https://gerrit.wikimedia.org/r/944888

Change 943616 merged by Hnowlan:

[operations/dns@master] wmnet: add discovery record for aqs

https://gerrit.wikimedia.org/r/943616

Change 939656 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: add route for metrics/knowledge-gap

https://gerrit.wikimedia.org/r/939656

Change 945607 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] api-gateway: allow non-/metrics paths in AQS case

https://gerrit.wikimedia.org/r/945607

Change 945607 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: move knowledge-gap endpoint from api-gateway

https://gerrit.wikimedia.org/r/945607

Change 945806 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: fix pathing for knowledge-gap

https://gerrit.wikimedia.org/r/945806

Change 945806 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: fix pathing for knowledge-gap

https://gerrit.wikimedia.org/r/945806

Change 946928 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: route knowledge-gap path via rest-gateway

https://gerrit.wikimedia.org/r/946928

It appears the endpoint has not been deployed to all AQS servers:

hnowlan@deploy1002:~$ for i in `seq 0 9`; do echo aqs101${i}.eqiad.wmnet; curl aqs101${i}.eqiad.wmnet:7232/analytics.wikimedia.org/v1/knowledge-gap/per-category/en.wikipedia/gender/female/20210101/20231201 2> /dev/null | grep -q errors/not_found && echo "failed" || echo "success"; done
aqs1010.eqiad.wmnet
success
aqs1011.eqiad.wmnet
success
aqs1012.eqiad.wmnet
success
aqs1013.eqiad.wmnet
success
aqs1014.eqiad.wmnet
success
aqs1015.eqiad.wmnet
success
aqs1016.eqiad.wmnet
failed
aqs1017.eqiad.wmnet
failed
aqs1018.eqiad.wmnet
failed
aqs1019.eqiad.wmnet
failed
hnowlan@deploy1002:~$ for i in `seq 0 9`; do echo aqs200${i}.codfw.wmnet; curl aqs101${i}.eqiad.wmnet:7232/analytics.wikimedia.org/v1/knowledge-gap/per-category/en.wikipedia/gender/female/20210101/20231201 2> /dev/null | grep -q errors/not_found && echo "failed" || echo "success"; done
aqs2000.codfw.wmnet
success
aqs2001.codfw.wmnet
success
aqs2002.codfw.wmnet
success
aqs2003.codfw.wmnet
success
aqs2004.codfw.wmnet
success
aqs2005.codfw.wmnet
success
aqs2006.codfw.wmnet
failed
aqs2007.codfw.wmnet
failed
aqs2008.codfw.wmnet
failed
aqs2009.codfw.wmnet
failed

This means that when the rest-gateway queries the lvs endpoint we get inconsistent results and occasional failures.

Additionally, would it be possible to add either a last-modified or etag and also a cache-control header to the service for caching purposes please?

Change 947385 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/aqs/deploy@master] Update aqs scap targets with new hosts and tidy up

https://gerrit.wikimedia.org/r/947385

Thanks @hnowlan - I checked and it looks like the scap targets for aqs weren't up-to date, so I made https://gerrit.wikimedia.org/r/947385 to try to fix it.

What I don't quite understand yet is how we ever got a deployment of aqs to the servers in codfw, because they weren't in the scap targets at all.

The versions deployed on aqs2001 and 2006 are also the same as each other:

btullis@aqs2001:/srv/deployment/analytics/aqs$ ls -l
total 4
lrwxrwxrwx 1 deploy-service deploy-service   58 Jun 13  2022 deploy -> deploy-cache/revs/d273fdea53565bbc9d8791352531bd0e9e60a893
drwxr-xr-x 4 deploy-service deploy-service 4096 Jun 13  2022 deploy-cache
btullis@aqs2006:~$ ls -l /srv/deployment/analytics/aqs/
total 4
lrwxrwxrwx 1 deploy-service deploy-service   58 Jun 14  2022 deploy -> deploy-cache/revs/d273fdea53565bbc9d8791352531bd0e9e60a893
drwxr-xr-x 4 deploy-service deploy-service 4096 Jun 14  2022 deploy-cache

...so I'm not clear why your test works on aqs2001 but doesn't work on aqs2006.

Thanks @hnowlan - I checked and it looks like the scap targets for aqs weren't up-to date, so I made https://gerrit.wikimedia.org/r/947385 to try to fix it.

What I don't quite understand yet is how we ever got a deployment of aqs to the servers in codfw, because they weren't in the scap targets at all.

When a new machine is provisioned it will do a scap deploy-local which will pull the code onto the machine rather than have it pushed, which explains why there's any version at all, albeit an old one. I have absolutely zero explanation as to how the service works on some of the codfw hosts though!

Question: once the machines are updated with https://gerrit.wikimedia.org/r/c/analytics/aqs/deploy/+/947385/, will the inconsistency of their current state be resolved? Or is there more work required to update the configuration?

Question: once the machines are updated with https://gerrit.wikimedia.org/r/c/analytics/aqs/deploy/+/947385/, will the inconsistency of their current state be resolved? Or is there more work required to update the configuration?

That seems likely yep!

What I don't quite understand yet is how we ever got a deployment of aqs to the servers in codfw, because they weren't in the scap targets at all.

The versions deployed on aqs2001 and 2006 are also the same as each other:

btullis@aqs2001:/srv/deployment/analytics/aqs$ ls -l
total 4
lrwxrwxrwx 1 deploy-service deploy-service   58 Jun 13  2022 deploy -> deploy-cache/revs/d273fdea53565bbc9d8791352531bd0e9e60a893
drwxr-xr-x 4 deploy-service deploy-service 4096 Jun 13  2022 deploy-cache
btullis@aqs2006:~$ ls -l /srv/deployment/analytics/aqs/
total 4
lrwxrwxrwx 1 deploy-service deploy-service   58 Jun 14  2022 deploy -> deploy-cache/revs/d273fdea53565bbc9d8791352531bd0e9e60a893
drwxr-xr-x 4 deploy-service deploy-service 4096 Jun 14  2022 deploy-cache

...so I'm not clear why your test works on aqs2001 but doesn't work on aqs2006.

I realise now that this was a pebkac - I changed the hostname in my echo command but not in my curl command. codfw hosts never responded for the service 🤦

Change 947385 merged by Btullis:

[analytics/aqs/deploy@master] Update aqs scap targets with new hosts and tidy up

https://gerrit.wikimedia.org/r/947385

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:43:48Z] <btullis@deploy1002> Started deploy [analytics/aqs/deploy@ec5d4cd]: T342213

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:44:47Z] <btullis@deploy1002> deploy aborted: T342213 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:45:57Z] <btullis@deploy1002> Started deploy [analytics/aqs/deploy@ec5d4cd] (aqs): T342213

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:47:45Z] <btullis@deploy1002> Finished deploy [analytics/aqs/deploy@ec5d4cd] (aqs): T342213 (duration: 01m 48s)

Change 949557 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/aqs/deploy@master] Fix scap targets

https://gerrit.wikimedia.org/r/949557

Change 949557 merged by Btullis:

[analytics/aqs/deploy@master] Fix scap targets

https://gerrit.wikimedia.org/r/949557

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:51:48Z] <btullis@deploy1002> Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:52:59Z] <btullis@deploy1002> Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:57:00Z] <btullis@deploy1002> Finished deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 (duration: 04m 00s)

Mentioned in SAL (#wikimedia-operations) [2023-08-16T16:57:08Z] <btullis@deploy1002> Started deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213

Mentioned in SAL (#wikimedia-operations) [2023-08-16T17:06:08Z] <btullis@deploy1002> Finished deploy [analytics/aqs/deploy@cf0e57d] (aqs): T342213 (duration: 09m 04s)

Thanks for the clarification @hnowlan

I see the deployment have been done by @BTullis, should this data be available now?

I see the hosts that didn't work before are returning data now

$ curl aqs2009.codfw.wmnet:7232/analytics.wikimedia.org/v1/knowledge-gap/per-category/en.wikipedia/gender/female/20210101/20231201 
{"items":[{"dt":"20230501","project":"en.wikipedia","content_gap":"gender","category":"female","metric":"article_count","value":375386}, [..snip...]

However, the public routing does not seem to work, or am I using the wrong url? https://wikimedia.org/api/rest_v1/v1/knowledge-gap/per-category/en.wikipedia/gender/female/20210101/20231201`

Thanks for the clarification @hnowlan

I see the deployment have been done by @BTullis, should this data be available now?

I see the hosts that didn't work before are returning data now

$ curl aqs2009.codfw.wmnet:7232/analytics.wikimedia.org/v1/knowledge-gap/per-category/en.wikipedia/gender/female/20210101/20231201 
{"items":[{"dt":"20230501","project":"en.wikipedia","content_gap":"gender","category":"female","metric":"article_count","value":375386}, [..snip...]

Great!

However, the public routing does not seem to work, or am I using the wrong url? https://wikimedia.org/api/rest_v1/v1/knowledge-gap/per-category/en.wikipedia/gender/female/20210101/20231201`

I couldn't leave the routing in place when it was discovered that these issues were present so it was rolled back. Before we can re-route traffic, we will need to add a last-modified or etag header, and also a cache-control header to the service for caching purposes.

@Milimetric & @fkaelin: is it trivial to add these headers to Knowledge Gaps in AQS? If it is non-trivial, should we instead migrate the service to AQS 2?

@SGupta and/or @Sfaci how are these headers handled in AQS 2? In the near mid term we will need to migrate to AQS 2 when we deprecate AQS. So what makes the most sense now: add the headers to the existing service or migrate to AQS 2?

Regarding AQS 2.0, I guess this new service could be managed in the same way we are managing the existing ones. At this moment, in AQS 2.0, those headers: etag, cache-control and others, are already managed for all the services. And it's really straightforward to add them for any new or existing one because we have a common middleware function to do that. It's an effortless task for us.

We are not managing last-modified header but, reading the last Hugh's comment, I assume it'd enough if we use etag.

• VirginiaPoundstone raised the priority of this task from Low to High.Sep 13 2023, 2:16 PM

@hnowlan: TL;DR; do you see the cache-control that AQS is already setting and do we need an ETag header or is that just a nice to have?

@BTullis: can you double check me here? The new endpoint is behaving in the same way as all the other AQS endpoints. So it adds cache-control and content-type headers here. These go out unchanged in the public-facing endpoints:

curl -s -D - -o /dev/null https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/monthly/2021080100/2023091800

...
cache-control: s-maxage=14400, max-age=14400
content-type: application/json; charset=utf-8
...

There is no etag header present on this endpoint. If that's a new requirement for the new routing, I'm not 100% sure what we'd set it to. I could checksum the content, which is what AQS 2.0 does, but it wouldn't help much. If the cache is talking to AQS, we'd still run the query to get the content to checksum, but maybe I'm not seeing another angle here.

In either case, if we want an etag, I can easily add a checksum, or a random guid, as people think.

@hnowlan: TL;DR; do you see the cache-control that AQS is already setting and do we need an ETag header or is that just a nice to have?

@BTullis: can you double check me here? The new endpoint is behaving in the same way as all the other AQS endpoints. So it adds cache-control and content-type headers here. These go out unchanged in the public-facing endpoints:

curl -s -D - -o /dev/null https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/monthly/2021080100/2023091800

...
cache-control: s-maxage=14400, max-age=14400
content-type: application/json; charset=utf-8
...

There is no etag header present on this endpoint. If that's a new requirement for the new routing, I'm not 100% sure what we'd set it to. I could checksum the content, which is what AQS 2.0 does, but it wouldn't help much. If the cache is talking to AQS, we'd still run the query to get the content to checksum, but maybe I'm not seeing another angle here.

In either case, if we want an etag, I can easily add a checksum, or a random guid, as people think.

I'm definitely not the expert on this, so I suspect that @hnowlan and @Sfaci know much more than I do but I'm happy to try to help.
Having a look here: https://wikitech.wikimedia.org/wiki/REST_Gateway#Acceptance_criteria
...it says:

The service must emit either an etag or last-modified header

Would a last-modified header be any easier to create than an etag header, or sufficient for this endpoint?

I'm sorry but I cannot say much more than you have said about these headers. I don't know the details about this topic. We are adding etag to AQS 2.0 headers because it was something that RESTBase did (AQS 2.0 is outside of there) and not feasible within envoy/. I guess that's the reason this tag was a new requirement for that project.

But I didn't know anything about we could choose between that header or last-modified. I haven't found any conversation about that related to AQS 2.0

Change 958945 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/aqs@master] Send etag header on all AQS responses

https://gerrit.wikimedia.org/r/958945

Ok, I hear Ben's concerns but kind of decided to risk updating everything at once (because it's easier to roll back now than when we move to AQS 2.0). When reviewed, deployed, and validated, I will move this to blocked until Hugh can review.

Change 958945 merged by Milimetric:

[analytics/aqs@master] Send etag header on all AQS responses

https://gerrit.wikimedia.org/r/958945

Milimetric updated Other Assignee, added: hnowlan.

AQS 1.0 is sending the required headers now, etag is enabled on all endpoints (not just knowledge gaps). Hugh, please verify and let us know if anything else needs to happen before we can route to the knowledge gaps endpoint. Thank you!

AQS 1.0 is sending the required headers now, etag is enabled on all endpoints (not just knowledge gaps). Hugh, please verify and let us know if anything else needs to happen before we can route to the knowledge gaps endpoint. Thank you!

Thanks! Responses on the service look good to me. I will try to make the routing change for this path on Thursday (I am PTO for a few days this week).

Change 960575 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: only pass requests for knowledge-gap on wikimedia.org

https://gerrit.wikimedia.org/r/960575

Change 960575 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: only pass requests for knowledge-gap on wikimedia.org

https://gerrit.wikimedia.org/r/960575

• VirginiaPoundstone lowered the priority of this task from High to Low.Oct 30 2023, 7:40 PM

As per a discussion with @fkaaelin, we are closing this task as declined and potentially reopen a new fresh one with a new plan to be prioritized for essential work.

Change #1071575 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: remove knowledge-gap configuration

https://gerrit.wikimedia.org/r/1071575

Change #1071575 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: remove knowledge-gap configuration

https://gerrit.wikimedia.org/r/1071575

Change #1072246 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox: bump image

https://gerrit.wikimedia.org/r/1072246

Change #1072246 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: bump image

https://gerrit.wikimedia.org/r/1072246

Change #946928 abandoned by Hnowlan:

[operations/puppet@production] trafficserver: route knowledge-gap path via rest-gateway

https://gerrit.wikimedia.org/r/946928