Page MenuHomePhabricator

AQS 2.0
Open, In Progress, HighPublic1 Estimated Story Points

Description

Analytics Query Service (AQS) is the software behind the /metrics family of endpoints in RESTBase. It is a read-only HTTP proxy to results served from Cassandra and Druid. It is currently based on a very outdated fork of RESTBase, and has received little updates over the years.

As a part of the goal to sunset RESTBase, AQS needs to be migrated to a bespoke service exposed via the API Gateway.


We propose to break down the rewrite largely along dataset boundaries — similar to the module structure in RESTBase — with a separate project used to implement each.

The services were renamed during development. The names as of Jan 2023 are:

  • Page Analytics (was pageviews)
  • Device Analytics (was unique devices)
  • Edit Analytics (the subset of endpoints previously considered under "wikistats2" that pertain to edits)
  • Editor Analytics (the subset of endpoints previously considered under "wikistats2" that pertain to editors)
  • Media Analytics (was mediarequests)
  • Geo Analytics (was called both geoeditors and editors in different contexts)

The breakdown of endpoints by service can be found here. The remainder of this task description has been left unedited, for comparison.


The resulting services will be proxied by RESTBase and/or the API Gateway (the former to eventually be deprecated in favor of the latter) in order to maintain complete compatibility with the existing API.

The target language for these implementations is Go. While a complete comparison of Javascript/NodeJS and Go is out of scope for this issue, the (simplified) rationale is:

  • Strong, static typing; Statically typed languages eliminate entire classes of bugs common to dynamic languages, improve security, and making code easier to reason about
  • Ease of use; Go is more obvious, more explicit, and easier to understand. Complicated concepts like concurrency are easier to get right
  • Performance; Service latency can be expected to be both lower, but more importantly, more predictable with Go

Overview

  1. Implement the new, stand-alone AQS service(s)
  2. Deploy to k8s
  3. Expose the /metrics hierarchy from the new service(s) using the API Gateway
  4. Switch RESTBase to proxying requests from the old AQS service, to the new k8s-based one
  5. Deprecate the http://{project}/api/rest_v1/metrics resources
  6. Eventually phase out the RESTBase /metrics hierarchy

Solving this will make us progress on multiple fronts: T198901 T262315


NOTE: This will be picked up by Platform Engineering, with support from Analytics.

See also:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS
https://github.com/wikimedia/analytics-aqs
https://github.com/wikimedia/restbase
https://wikimedia.org/api/rest_v1/
https://gitlab.wikimedia.org/eevans/aqs (work-in-progress AQS 2.0 code repository)

Related Objects

StatusSubtypeAssignedTask
StalledNone
In ProgressNone
In ProgressDAbad
InvalidNone
OpenNone
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
InvalidEChukwukere-WMF
OpenFGoodwin
OpenNone
In ProgressBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedSGupta-WMF
ResolvedBPirkle
ResolvedFGoodwin
ResolvedFGoodwin
ResolvedEChukwukere-WMF
ResolvedSGupta-WMF
OpenNone
OpenNone
OpenNone
In ProgressBPirkle
ResolvedFGoodwin
ResolvedEChukwukere-WMF
Resolvedcodebug
ResolvedSGupta-WMF
Opencodebug
InvalidBPirkle
InvalidFGoodwin
InvalidNone
InvalidNone
InvalidNone
InvalidNone
In ProgressBPirkle
OpenFGoodwin
OpenNone
In ProgressBPirkle
ResolvedFGoodwin
ResolvedFGoodwin
OpenNone
ResolvedJArguello-WMF
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenNone
InvalidNone
OpenNone
ResolvedAtieno
ResolvedMstyles
Openapaskulin
OpenNone
OpenBPirkle
ResolvedBPirkle
ResolvedBPirkle
OpenDAbad
ResolvedBPirkle
ResolvedBPirkle
Resolvedcodebug
ResolvedSGupta-WMF
DuplicateSpikeVirginiaPoundstone
ResolvedSpikeSGupta-WMF
ResolvedSpikeVirginiaPoundstone
Resolvedcodebug
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedSpikeVirginiaPoundstone
Opendaniel
ResolvedBPirkle

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  1. (Optional) Add support in service-template-node for talking to Cassandra. Since we are planning to move storage from RESTBase down to individual services, we might benefit from some shared library support. This step however is optional - perhaps we could just use Cassandra driver directly without additional abstractions. To be investigated.

agreed this could be done either way, the code that actually hits Cassandra is very tiny

  1. Finish support for Cassandra schema distribution, leftover from sessionstore project. Currently in RESTBase the schema is stored in code and can be created upon software startup. This pattern has been proven to be unsuitable for production, so the schema is actually created manually. In Kask, we've moved away from this pattern, and the schema is created only manually. Automation for schema/options distribution has never been finished for Kask, but now if we are to start migrating more service off RESTBase, we need better ways for schema distribution T220246

This would be nice, we sometimes update schemas. But it is very rare, so if this is the blocker, maybe we can talk and work around it. I'd estimate implementing it would only save us 10-20 hours a year at most.

  1. Rewrite AWS service using service-template-node. Because AQS codebase is quite simple, this should not be too problematic. Some of the RESTBase built-in features, for example request parameter validation, will have to be reimplemented since there's no magic support for it in service-template. Upgrade to node10 only in the meantime.

When you say "Rewrite" I feel like you kind of just mean we'll reconfigure the interfaces, right? Like, there's no need to rewrite how the druid queries are built, but what we get passed in our handlers will just get reimplemented?

  1. Deploy AQS 2.0 on k8s. Since AQS will be based on service-template-node, existing patterns for k8s deployments of node services can be reused. Cassandra connection setup could be borrowed from either of the Kask deployments.

Ping @elukey here, this would be the biggest change and I'm not exactly sure how it impacts us. Right now AQS and Cassandra instances are colocated and I know we took some steps to isolate them but I'm not sure if it would be a pain to reconfigure AQS to run on a separate cluster. I know there are some conveniences right now as the mapping is 1/1. So when we deploy a schema change, we can depool one, deploy, test, and do the rest. I guess we could keep that 1/1 mapping but I'm just not sure if that just sounds easy but isn't.

  1. Expose /metrics hierarchy in API Gateway. Deprecate /api/rest_v1/metrics hierarchy. Switch RESTBase to proxying requests from old AQS cluster to the new, k8s AQS cluster.
  2. Eventually phase out RESTBase /metrics hierarchy.

Sounds good, no strong opinion on how that layer works.

This would be nice, we sometimes update schemas. But it is very rare, so if this is the blocker, maybe we can talk and work around it. I'd estimate implementing it would only save us 10-20 hours a year at most.

Agreed, no need to block on it, but I would love to have some progress on it or at least keep it in mind. It's not about saving work ours, it's much more about recreate-ability of our environment.

When you say "Rewrite" I feel like you kind of just mean we'll reconfigure the interfaces, right? Like, there's no need to rewrite how the druid queries are built, but what we get passed in our handlers will just get reimplemented?

Yeah, most of the "rewrite" will be copy-paste from one place to another. There will be small change in how the cassandra queries are built. But yeah, "rewrite" is too big of a word for it.

I know there are some conveniences right now as the mapping is 1/1

In terms of instances, it's actually N/N - all AQS service instances can talk to all cassandra nodes. With k8s we just add N more AQS instances. and then remove the first N.

Naike set the point value for this task to 1.Oct 2 2020, 9:42 AM

Ping @Pchelolo, @lexnasser was looking at this as the next thing he might focus on. I hesitated to ping before because I know your plate's full. My question is, what are your plans with this upgrade, and can we take over part of it with Lex as a resource? So, one option that might work would be you & team do the service-template-node updates (Cassandra support, etc. as discussed above), and we (mostly Lex) do the AQS rewrite (maybe even TypeScript - if we can trick Lex :P). Thoughts?

Milimetric triaged this task as Medium priority.May 10 2021, 7:13 PM
Milimetric updated the task description. (Show Details)
Milimetric added a subscriber: WDoranWMF.
Eevans updated the task description. (Show Details)
Eevans edited subscribers, added: FGoodwin, Clarakosi; removed: lexnasser, elukey, Nuria.
Eevans updated the task description. (Show Details)

We can use https://gitlab.wikimedia.org/eevans/aqs to get started, and open merge-requests for code review there. Depending on the state of the Gitlab rollout as we get nearer completion, we can either move it to a dedicated project, or set it up in Gerrit in the usual way.

If there are no objections, we'll implement the new service Go.

DAbad added subscribers: nnikkhoui, BPirkle, DAbad.

AQS 2.0 to API Platform

  • as per team meeting AQS 2.0 work will begin to fall under the API Platform workstream
  • @Eevans will start to include @nnikkhoui and @BPirkle in code reviews & in 2x weekly standups for AQS 2.0
DAbad changed the task status from Open to In Progress.Apr 26 2022, 2:22 PM
DAbad claimed this task.
DAbad raised the priority of this task from Medium to High.

August 30, 2022

BPirkle added a subscriber: Unknown Object (User).Sep 7 2022, 1:43 PM