Page MenuHomePhabricator

AQS 2.0: Editor Analytics: Implement endpoints
Closed, ResolvedPublic

Description

Implement endpoints for the Editor Analytics service. These endpoints are Druid-based.

Completion criteria: the following endpoints are implemented:

  • editors/aggregate
  • editors/top-by-edits
  • editors/top-by-net-bytes-difference
  • editors/top-by-absolute-bytes-difference
  • registered_users/new

See the parent task for discussion on a reusable package for commonalities between endpoints in this service and endpoints in the Edit Analytics service. There is additional related discussion in T288301: AQS 2.0:Wikistats 2 service.

Extremely rough proof-of-concept code for querying Druid can be found here.

Useful information about the Druid schema can be found here

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

At this moment all the endpoints are implemented taking the following into consideration:

  • All the endpoints are using the builder pattern to build and run the query. Is the exploratory Druid code ok? (builder pattern vs SQL vs native queries)
  • I think the aggregate query should be improved because I had to filter some data manually outside the Druid query (aggregate_data.go). is there any builder pattern limitation?
  • Responses have been compared with AQS 1.0 in production. For a couple of cases, e.g.: new_registered_users, AQS 1.0 response values don't match with the AQS 2.0 ones. But the last ones match with the production data. More exploration is needed to figure out what's wrong

All the endpoints are using the builder pattern to build and run the query. Is the exploratory Druid code ok? (builder pattern vs SQL vs native queries)

I like the builder pattern.

I think the aggregate query should be improved because I had to filter some data manually outside the Druid query (aggregate_data.go). is there any builder pattern limitation?

You could:

  • look through the existing AQS 1.0 code to see how it constructs its query
  • if you're brave enough (I'm not sure I would be) set up a local installation of AQS 1.0 that executes against our mock datasets and have it print its query
  • As the Data Engineering folks. They may be able to just tell you what query they're executing and/or pull an example from Druid query logging (if such logging exists)

Thanks @BPirkle!

Both issues (registered users and aggregate) have been fixed after asking Data Platform team and debugging a bit AQS 1.0 tests suite. They gave me different point of view about how to use the data to build the right queries with Druid.

Because it could be useful to take a look again at AQS 1.0 to learn how to build the right queries when using Druid, I'm going to paste here some tips that @Milimetric gave me to fix these issues:

hi @Sfaci, when we developed this we used a query builder pattern to abstract a little on top of straight druid. We kind of went back and forth on that, but as a result the queries are hard to read. As Bill suggested on the task, you can look at the code implementing this endpoint. You'd find it by looking at this yaml (line 24 is the endpoint you're looking for). Then the js code implementing that operation. You can either add a console.log after line 391 there, and run the unit tests, or you can trace the calls to the druid util lib to figure out what it's building. If you need to run AQS, just install nvm and use node 10.24 before you npm install and npm test as that's the last version of node that this thing runs on :)
similarly, I see in the task that your newly registered editors numbers don't match, you can find that code here.

I have pushed a new patch set with all changes I have mentioned in my last comment.

At this moment, the MR keeps as DNM but it seems builder pattern queries are working really fine for all endpoints. That's why I think we can move this ticket to code review just to start discussing about if all this code could be considered a right way to do it or we have to think more about it.

Change 924549 abandoned by Santiago Faci:

[generated-data-platform/aqs/editor-analytics@main] [DNM] All endpoins are already added and they seem to work. Some improvements have been added regarding how to write the code that communicates with Druid and manages its structures

Reason:

https://gerrit.wikimedia.org/r/924549

Change 930760 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/editor-analytics@main] all endpoints implemented

https://gerrit.wikimedia.org/r/930760

In the last days some validation issues have been fixed and middleware function has been changed to include 'Etag' header

Change 924549 restored by Santiago Faci:

[generated-data-platform/aqs/editor-analytics@main] [DNM] All endpoins are already added and they seem to work. Some improvements have been added regarding how to write the code that communicates with Druid and manages its structures

https://gerrit.wikimedia.org/r/924549

Change 924549 merged by Santiago Faci:

[generated-data-platform/aqs/editor-analytics@main] All endpoins are already added and they seem to work. Some improvements have been added regarding how to write the code that communicates with Druid and manages its structures

https://gerrit.wikimedia.org/r/924549

Change 930760 merged by Santiago Faci:

[generated-data-platform/aqs/editor-analytics@main] all endpoints implemented

https://gerrit.wikimedia.org/r/930760

Sfaci removed a project: Patch-For-Review.

After the last merge (that includes endpoints + unit tests + integration tests + doc + spec endpoint) some refactoring about common Druid code must be done. Moving to "Next Up" to keep working on that.

This Druid-based service should be the first one to be refactored because tests and itests are already done and we can take some advantage of that while and after refactoring it (just running them to be sure that the functionality hasn't been broken). I'll take this ticket while creating common Druid-based code T343907: AQS 2.0: aqsassist - Extract some repeated common code from services to this module just to point that out. In theory, service endpoints are the only thing we should have to refactor.

Don't forget to review how we are using the 'all-projects', 'all-wikipedia-projects', . . . . filters because we realized about this after finishing endpoint implementations. In cassandra the data is already prepare to use them directly (data is already denormalized by family project) but in Druid the data is not denormalized by that criteria. We need to code some logic to look for an specific project, for all projects and also for a family one (all-wikipedia-projects, all-wiktionary-projects, . . . ) for some specific endpoints. We have to take this opportunity to implement that properly.

Change 952469 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/editor-analytics@main] Druid code refactorization

https://gerrit.wikimedia.org/r/952469

Change 952469 merged by Santiago Faci:

[generated-data-platform/aqs/editor-analytics@main] Druid code refactorization

https://gerrit.wikimedia.org/r/952469

Sfaci moved this task from Ready for Testing to In Process on the Data Products (Sprint 01) board.
Sfaci added a subscriber: EChukwukere-WMF.

Some issues have been found while working on the AQS test suite for this service, so the ticket has been moved to "In Process" again to fix them:

Before pushing the MR for the open issues, a needed change has been done in aqsassist to skip zero-fill values to be able to respond with the right 404 Not Found error when requesting an invalid project. A MR for aqsassist is waiting to be reviewed: https://gitlab.wikimedia.org/frankie/aqsassist/-/merge_requests/24

Change 960079 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/editor-analytics@main] Several post-implementation issues fixed

https://gerrit.wikimedia.org/r/960079

Last change includes fixes for all the issues mentioned above and all unit and integration tests adjusted to the new behaviour

Change 960079 merged by jenkins-bot:

[generated-data-platform/aqs/editor-analytics@main] Several post-implementation issues fixed

https://gerrit.wikimedia.org/r/960079

Change 965167 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/editor-analytics@main] Removing IP addresses where present as user_text values in the top endpoints

https://gerrit.wikimedia.org/r/965167

The new MR has been pushed to remove the IP addresses shown when they comes as user_text values in some top endpoints (they have been replaced by null value).

Change 965167 merged by jenkins-bot:

[generated-data-platform/aqs/editor-analytics@main] Removing IP addresses where present as user_text values in the top endpoints

https://gerrit.wikimedia.org/r/965167