Page MenuHomePhabricator

AQS 2.0
Open, Needs TriagePublic1 Estimated Story Points

Description

Analytics Query Service (AQS) is the software behind /metrics family of endpoints in RESTBase. Essentially it's a read-only HTTP proxy to Cassandra and Druid backends. It's currently based on a very old version of RESTBase, the codebase was initially forked from RESTBase codebase and received little updates over the years.

As a part of the goal to sunset RESTBase, AQS has to be migrated from RESTBase codebase to service-template-node to be in line with the rest of the services and exposed in API Gateway via envoy.

The plan:

  1. (Optional) Add support in service-template-node for talking to Cassandra. Since we are planning to move storage from RESTBase down to individual services, we might benefit from some shared library support. This step however is optional - perhaps we could just use Cassandra driver directly without additional abstractions. To be investigated.
  2. Finish support for Cassandra schema distribution, leftover from sessionstore project. Currently in RESTBase the schema is stored in code and can be created upon software startup. This pattern has been proven to be unsuitable for production, so the schema is actually created manually. In Kask, we've moved away from this pattern, and the schema is created only manually. Automation for schema/options distribution has never been finished for Kask, but now if we are to start migrating more service off RESTBase, we need better ways for schema distribution T220246
  3. Rewrite AWS service using service-template-node. Because AQS codebase is quite simple, this should not be too problematic. Some of the RESTBase built-in features, for example request parameter validation, will have to be reimplemented since there's no magic support for it in service-template. Upgrade to node10 only in the meantime.
  4. Deploy AQS 2.0 on k8s. Since AQS will be based on service-template-node, existing patterns for k8s deployments of node services can be reused. Cassandra connection setup could be borrowed from either of the Kask deployments.
  5. Expose /metrics hierarchy in API Gateway. Deprecate /api/rest_v1/metrics hierarchy. Switch RESTBase to proxying requests from old AQS cluster to the new, k8s AQS cluster.
  6. Eventually phase out RESTBase /metrics hierarchy.

Solving this will make us progress on multiple fronts: T198901 T210704 T262315

Event Timeline

Pchelolo created this task.Sep 21 2020, 6:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2020, 6:18 PM
Pchelolo updated the task description. (Show Details)Sep 21 2020, 6:33 PM
  1. (Optional) Add support in service-template-node for talking to Cassandra. Since we are planning to move storage from RESTBase down to individual services, we might benefit from some shared library support. This step however is optional - perhaps we could just use Cassandra driver directly without additional abstractions. To be investigated.

agreed this could be done either way, the code that actually hits Cassandra is very tiny

  1. Finish support for Cassandra schema distribution, leftover from sessionstore project. Currently in RESTBase the schema is stored in code and can be created upon software startup. This pattern has been proven to be unsuitable for production, so the schema is actually created manually. In Kask, we've moved away from this pattern, and the schema is created only manually. Automation for schema/options distribution has never been finished for Kask, but now if we are to start migrating more service off RESTBase, we need better ways for schema distribution T220246

This would be nice, we sometimes update schemas. But it is very rare, so if this is the blocker, maybe we can talk and work around it. I'd estimate implementing it would only save us 10-20 hours a year at most.

  1. Rewrite AWS service using service-template-node. Because AQS codebase is quite simple, this should not be too problematic. Some of the RESTBase built-in features, for example request parameter validation, will have to be reimplemented since there's no magic support for it in service-template. Upgrade to node10 only in the meantime.

When you say "Rewrite" I feel like you kind of just mean we'll reconfigure the interfaces, right? Like, there's no need to rewrite how the druid queries are built, but what we get passed in our handlers will just get reimplemented?

  1. Deploy AQS 2.0 on k8s. Since AQS will be based on service-template-node, existing patterns for k8s deployments of node services can be reused. Cassandra connection setup could be borrowed from either of the Kask deployments.

Ping @elukey here, this would be the biggest change and I'm not exactly sure how it impacts us. Right now AQS and Cassandra instances are colocated and I know we took some steps to isolate them but I'm not sure if it would be a pain to reconfigure AQS to run on a separate cluster. I know there are some conveniences right now as the mapping is 1/1. So when we deploy a schema change, we can depool one, deploy, test, and do the rest. I guess we could keep that 1/1 mapping but I'm just not sure if that just sounds easy but isn't.

  1. Expose /metrics hierarchy in API Gateway. Deprecate /api/rest_v1/metrics hierarchy. Switch RESTBase to proxying requests from old AQS cluster to the new, k8s AQS cluster.
  2. Eventually phase out RESTBase /metrics hierarchy.

Sounds good, no strong opinion on how that layer works.

This would be nice, we sometimes update schemas. But it is very rare, so if this is the blocker, maybe we can talk and work around it. I'd estimate implementing it would only save us 10-20 hours a year at most.

Agreed, no need to block on it, but I would love to have some progress on it or at least keep it in mind. It's not about saving work ours, it's much more about recreate-ability of our environment.

When you say "Rewrite" I feel like you kind of just mean we'll reconfigure the interfaces, right? Like, there's no need to rewrite how the druid queries are built, but what we get passed in our handlers will just get reimplemented?

Yeah, most of the "rewrite" will be copy-paste from one place to another. There will be small change in how the cassandra queries are built. But yeah, "rewrite" is too big of a word for it.

I know there are some conveniences right now as the mapping is 1/1

In terms of instances, it's actually N/N - all AQS service instances can talk to all cassandra nodes. With k8s we just add N more AQS instances. and then remove the first N.

Naike set the point value for this task to 1.Fri, Oct 2, 9:42 AM