Page MenuHomePhabricator

AQS 2.0
Open, MediumPublic1 Estimated Story Points

Description

Analytics Query Service (AQS) is the software behind the /metrics family of endpoints in RESTBase. It is a read-only HTTP proxy to results served from Cassandra and Druid. It is currently based on a very fork of RESTBase, and has received little updates over the years.

As a part of the goal to sunset RESTBase, AQS needs to be migrated to a bespoke service exposed via the API Gateway.

Overview

  1. Implement a new, stand-alone AQS service
  2. Deploy AQS 2.0 on k8s
  3. Expose the /metrics hierarchy from the new service using the API Gateway
  4. Switch RESTBase to proxying requests from the old AQS service, to the new k8s-based one
  5. Deprecate the http://{project}/api/rest_v1/metrics resources
  6. Eventually phase out the RESTBase /metrics hierarchy

Solving this will make us progress on multiple fronts: T198901 T262315


NOTE: This will be picked up by Platform Engineering, with support from Analytics.

See also:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS
https://github.com/wikimedia/analytics-aqs
https://github.com/wikimedia/restbase
https://wikimedia.org/api/rest_v1/
https://gitlab.wikimedia.org/eevans/aqs (work-in-progress AQS 2.0 code repository)

Event Timeline

  1. (Optional) Add support in service-template-node for talking to Cassandra. Since we are planning to move storage from RESTBase down to individual services, we might benefit from some shared library support. This step however is optional - perhaps we could just use Cassandra driver directly without additional abstractions. To be investigated.

agreed this could be done either way, the code that actually hits Cassandra is very tiny

  1. Finish support for Cassandra schema distribution, leftover from sessionstore project. Currently in RESTBase the schema is stored in code and can be created upon software startup. This pattern has been proven to be unsuitable for production, so the schema is actually created manually. In Kask, we've moved away from this pattern, and the schema is created only manually. Automation for schema/options distribution has never been finished for Kask, but now if we are to start migrating more service off RESTBase, we need better ways for schema distribution T220246

This would be nice, we sometimes update schemas. But it is very rare, so if this is the blocker, maybe we can talk and work around it. I'd estimate implementing it would only save us 10-20 hours a year at most.

  1. Rewrite AWS service using service-template-node. Because AQS codebase is quite simple, this should not be too problematic. Some of the RESTBase built-in features, for example request parameter validation, will have to be reimplemented since there's no magic support for it in service-template. Upgrade to node10 only in the meantime.

When you say "Rewrite" I feel like you kind of just mean we'll reconfigure the interfaces, right? Like, there's no need to rewrite how the druid queries are built, but what we get passed in our handlers will just get reimplemented?

  1. Deploy AQS 2.0 on k8s. Since AQS will be based on service-template-node, existing patterns for k8s deployments of node services can be reused. Cassandra connection setup could be borrowed from either of the Kask deployments.

Ping @elukey here, this would be the biggest change and I'm not exactly sure how it impacts us. Right now AQS and Cassandra instances are colocated and I know we took some steps to isolate them but I'm not sure if it would be a pain to reconfigure AQS to run on a separate cluster. I know there are some conveniences right now as the mapping is 1/1. So when we deploy a schema change, we can depool one, deploy, test, and do the rest. I guess we could keep that 1/1 mapping but I'm just not sure if that just sounds easy but isn't.

  1. Expose /metrics hierarchy in API Gateway. Deprecate /api/rest_v1/metrics hierarchy. Switch RESTBase to proxying requests from old AQS cluster to the new, k8s AQS cluster.
  2. Eventually phase out RESTBase /metrics hierarchy.

Sounds good, no strong opinion on how that layer works.

This would be nice, we sometimes update schemas. But it is very rare, so if this is the blocker, maybe we can talk and work around it. I'd estimate implementing it would only save us 10-20 hours a year at most.

Agreed, no need to block on it, but I would love to have some progress on it or at least keep it in mind. It's not about saving work ours, it's much more about recreate-ability of our environment.

When you say "Rewrite" I feel like you kind of just mean we'll reconfigure the interfaces, right? Like, there's no need to rewrite how the druid queries are built, but what we get passed in our handlers will just get reimplemented?

Yeah, most of the "rewrite" will be copy-paste from one place to another. There will be small change in how the cassandra queries are built. But yeah, "rewrite" is too big of a word for it.

I know there are some conveniences right now as the mapping is 1/1

In terms of instances, it's actually N/N - all AQS service instances can talk to all cassandra nodes. With k8s we just add N more AQS instances. and then remove the first N.

Naike set the point value for this task to 1.Oct 2 2020, 9:42 AM

Ping @Pchelolo, @lexnasser was looking at this as the next thing he might focus on. I hesitated to ping before because I know your plate's full. My question is, what are your plans with this upgrade, and can we take over part of it with Lex as a resource? So, one option that might work would be you & team do the service-template-node updates (Cassandra support, etc. as discussed above), and we (mostly Lex) do the AQS rewrite (maybe even TypeScript - if we can trick Lex :P). Thoughts?

Milimetric triaged this task as Medium priority.May 10 2021, 7:13 PM
Milimetric updated the task description. (Show Details)
Milimetric added a subscriber: WDoranWMF.
Eevans updated the task description. (Show Details)
Eevans edited subscribers, added: FGoodwin, Clarakosi; removed: lexnasser, elukey, Nuria.
Eevans updated the task description. (Show Details)

We can use https://gitlab.wikimedia.org/eevans/aqs to get started, and open merge-requests for code review there. Depending on the state of the Gitlab rollout as we get nearer completion, we can either move it to a dedicated project, or set it up in Gerrit in the usual way.

If there are no objections, we'll implement the new service Go.