Page MenuHomePhabricator

AQS 2.0:Wikistats 2 service
Closed, InvalidPublic

Description

The Wikistats 2 API provides access to edits, edited-pages, editors and newly-registered-users.

NOTE: This API uses a Druid back-end.

See: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2

Event Timeline

Eevans updated the task description. (Show Details)

Quick note on this one: the backend for these metrics is not cassandra but druid, making implementation somehow different. Happy to talk about this more as needed.

Quick note on this one: the backend for these metrics is not cassandra but druid, making implementation somehow different. Happy to talk about this more as needed.

Yes, we were (so far) putting a pin here and planning to circle back (because it is more opaque to us than the others). For example: Ideally the dev/test setup (T288160) would have Druid running with test data in place, and I'm not (yet) sure how deep that rabbit hole goes.

Wikistats is the only one that uses Druid though, yes?

Wikistats is the only one that uses Druid though, yes?

Yes!

@JAllemandou , we're finally getting ready to do actual work on this endpoint and could use some advice, specifically on best practices for testing.

For the endpoints that pull data from Cassandra, we have a Docker Compose test environment that Frankie and Erik set up. It lets us conveniently run local Cassandra with known test data against which to execute our (still somewhat formative) unit tests.

We're wondering if (1) it is feasible to do something similar for Druid, and (2) if there is existing work we could leverage, or if (3) we should use a different approach altogether for Druid.

My personal knowledge of Druid consists of playing one in D&D when I was a teenager, which so far has proved less than helpful. Although sometimes it feels like I've cast Entangling Roots on myself. ;-)

I see there is a apache/druid image on Docker Hub. I also see some bits about fake-druid in the existing production repo.

We can continue digging into all this on our own, but thought we'd ask before we burned too much time, in case there was an existing path (or at least an idea about one), so that we at least start in the right general direction.

[edited with corect link- thanks @Milimetric ] Hi @BPirkle - I can't help with the entangled roots unfortunately - the poor warrior I am would not deal with any magic by any mean :)

As for the system behind AQS, I have used two ways of testing when I have developed the endpoint (it was quite some time ago!):

  • Using SSH tunnels against the production cluster (not great :S)

The way the code for the mediawiki-history endpoint is built is using a Druid query DSL to help generate queries: https://github.com/wikimedia/analytics-aqs/blob/master/lib/druidUtil.js
I don't know how you plan on building for the new endpoint but the DSL has proven useful in the previous case.

Also, in the existing version, multiple external endpoints in https://github.com/wikimedia/analytics-aqs/tree/master/v1 (to be precise bytes-difference, edited-pages, editors, edits and registered-users) were all redirecting toward a single internal module in https://github.com/wikimedia/analytics-aqs/blob/master/sys (mediawiki-history-metrics.yaml and mediawiki-history-metrics.yaml). I wonder if the same pattern is be replicated or not ...

Happy to discuss more on anything related, don't hesitate to ping :)

It took both @JAllemandou 's comments above and @Milimetric 's comments on T311190 before I really understood how the existing production system works. At least, I think I understand it now, so I'm going to restate it and someone can correct me if I'm still off-base.

wikistats2 contains the following endpoints:

/metrics/edits/
/metrics/bytes-difference/
/metrics/edited-pages/
/metrics/editors/
/metrics/registered-users/

(Some of these offer various statistics under subpaths such as metrics/edited-pages/top-by-edits vs metrics/edited-pages/top-by-net-bytes-difference, see the docs for details.)

All this data ultimately comes from Druid, but it does so (at least in some cases) by way of these internal (non-public) endpoints:

/digests/
/revisions/

We can see this in places like this yaml file.

Whether or not we use this model in AQS 2.0 is still TBD, but it'd sense to consider some variation on it. I'm not sure whether exposing a private endpoint fits our model, or if we might instead use something like common functions or a common library to accomplish something similar. But we might use the existing production system to help us identify what functionality would be useful to group.

Here's my current understanding of the full set of endpoints we need for AQS 2.0 (including all endpoints, not just wikistats2):
https://docs.google.com/spreadsheets/d/1nl-4zjd5OfbgINsVGwEc5jh5_xEexz8H7-c5ZIFpopk/edit#gid=0

BPirkle renamed this task from AQS 2.0: Implement wikistats 2 endpoints to AQS 2.0:Wikistats 2 service.Sep 14 2022, 2:56 AM

@BPirkle: sorry so late. Only one tiny misunderstanding left, I think. Wikistats 2 pulls data from all the endpoints you have in your spreadsheet. For example:

pageviews: https://stats.wikimedia.org/#/all-projects/reading/total-page-views
mediarequests: https://stats.wikimedia.org/#/all-projects/content/total-mediarequests
editors: https://stats.wikimedia.org/#/ro.wikipedia.org/contributing/editors

This may be useful if you want to see data flowing, because you can navigate the interface and log XHRs on the console to see how it's hitting the APIs.

@BPirkle: sorry so late. Only one tiny misunderstanding left, I think. Wikistats 2 pulls data from all the endpoints you have in your spreadsheet.

Thanks! This is actually great timing, because we're close to starting what we were about to call the "Wikistats 2" service and are therefore finally starting to think in detail about it.

I was already wondering if "Wikistats 2" was really the best name, and also if it would be more consistent with how we're approaching other endpoints in AQS 2.0 to subdivide the remaining endpoints into separate services. (I had a conversation with Eric to that effect, and he didn't disagree.) It seems to me like we use the term "Wikistats 2" a little imprecisely right now (or are at least in danger of doing so), by having it refer to both a user-facing client and to a subset of endpoints used by that client. So maybe this is an opportunity to clarify.

There are several lines that we could split on. Thus far, we've been more-or-less splitting on RestBASE module boundaries, which correspond roughly to public paths (at least, the path component right after "metrics/" in urls like "/metrics/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}". That's reasonable, works pretty well, and I don't think we need to change much that we've already done. (We might change the editors service, see below.)

However, what we have been calling "wikistats 2" doesn't divide quite so cleanly. And in fact, the term "wikistats" does not appear in the existing production code. Some of its public-facing paths that look similar are, from a db query perspective, very different. Some endpoints that involve very similar queries are located at very different paths. So if we group by path, we end up duplicating query code. If we group by similar query, we end up splitting similar paths into several services.

The existing production AQS code grapples with this by having private endpoints at paths "/revisions/" and "/digests/". The public facing endpoints internally either handle the request themselves, or point at either /revisions/ or /digests/. Then the code that actually handles "/revisions/" and "/digests/" makes one kind of query. In this way, endpoints located at very different public paths but with similar query needs don't duplicate db query code.

Here's a little table that hopefully explains this better than all those words:

public pathprivate path (s)
bytes-differencerevisions
registered-usersregistered-users
editsrevisions
edited-pagesedited-pages, revisions, digests
editorseditors, revisions, digests, plus a Cassandra-based endpoint

When the table says something like "registered-users" as a private path for the registered-users public path, it just means that registered-users has its own query code (instead of using revisions or digests). So our messiest public path, "editors", has some subpaths handled by its own query code, some handled internally by "/revisions/", some handled internally by "/digests/", and one that hits Cassandra instead of Druid and lives in a completely separate service.

Looking at this another way, here are the public paths that each of "/revisions/" and "/digests/" handles:

revisionsdigests
edited-pagesedited-pages
editorseditors
bytes_difference
edits

We could do a similar grouping without the internal paths, by instead having helper functions in the service that make the similar queries. But that leaves us putting 17 of our 27 AQS 2.0 endpoints in one service. If part of our goal was to try using something more like microservices, this starts feeling less "micro".

Or we could put the db query code into a different repository and import that into multiple services, broken up in whatever way we think makes sense.

The "editors" path is the trickiest, because most of its subpaths hit Druid data, but one hits Cassandara data. The way we've divided up testing environments makes it inconvenient (but not impossible) to have one "microservice" execute tests against multiple testing environments. At a minimum, it means that testing a service that hits both Cassandra and Druid means spinning up two Docker environments, which is slower and more memory intensive than spinning up just one.

Options:

  1. instead of creating a "Wikistats 2" service for AQS 2.0, create a "mediawiki history" service. This would contain all the endpoints that expose data from the mediawiki_history_reduced dataset. Write helper functions to implement similar queries shared by different public paths. (Alternative service names I considered: "mediawiki history reduced", which seemed too long and awkward for a minimal gain in precision, and just "history", which seemed too vague)
  1. create multiple services, with naming inspired by the the existing AQS code:
    • revisions (for bytes-difference, edits, , some edited-pages endpoints, and some Druid-based editors paths)
    • digests (for some edited-pages endpoints and some Druid-based editors paths)
    • registered users
    • edited pages
    • editors
    • rename the existing "editors" service to something like "editors by country". Alternatively, call the new one something like "editors history".
  1. add helper functions that serve the same purpose as "/revisions/" and "/digests/" into the aqsassist repository. Then create multiple services that import and use those functions as needed:
    • bytes-difference
    • registered-users
    • edits
    • edited-pages
    • editors
    • rename the existing "editors" service to something "editors by country". Alternatively, call the new one something like "editors history".
  1. same as #3, but create something like "aqsquery" instead of putting the new query functions into aqsassist, if we liked that better
  1. same as #2, #3, or #4, but just have one "editors" service and have it serve data out of both Cassandra and Druid. Work through how to make both testing envs happy.
  1. whatever anyone else can think of

That was a lot of words. We should decide what we want to do in the next few days, so I can adjust tickets for next sprint. But we don't have to decide today. So please read over this, think about the options, see if you have any questions or different/additional suggestions, and we can decide how to proceed.

(Also, Dan or Joseph, please correct me if you see any of that I got wrong!)

This comment was removed by Milimetric.

@BPirkle: that's well summarized, I think you captured the messiness of this in ways that I didn't see as we were building it. I think the cleanest way forward is to group endpoints into services based on query similarity, but not to force it where it's not natural. Basically, your option 1 with a tweak I'm going to think out loud about below.

I hope we don't change data stores often, but if we moved mediawiki_history from druid to clickhouse, for example, the organization in option 1 would be the easiest to interact with. In simpler cases, like if Druid launches a new feature, the same reasoning applies. And endpoints themselves are very unlikely to change in ways that are unrelated to their underlying storage. So option 1 seems solid. And that leaves two questions: how does that look for the rest of the datasets we query and, of course, naming.

First, how rigid should the endpoint -> service -> dataset -> data store grouping be? A service with all the endpoints querying mediawiki-history makes sense, but editors/by-country makes sense in there too. But it queries Cassandra. Writing this down so I can see it:

endpointsservicesource datadatastore
bytes-difference, registered-users, edits, edited-pages, editors (except editors/by-country)editing_analyticsmediawiki_historyDruid
pageviews (per article, aggregate, etc.)page_analyticspageview_hourlyCassandra
mediarequests (top, per file)media_analyticsmediarequest_hourlyCassandra
editors (just editors/by-country)??geoeditorsCassandra

It doesn't really make sense to group editors/by-country in page_analytics just because it queries Cassandra, but it might get lost in editing_analytics. I can see two options:

  1. For every pair of group of endpoints | datastore, we make a service, something like:
    • editing_analytics_druid
    • editing_analytics_cassandra
    • page_analytics_cassandra
    • media_analytics_cassandra
  1. Separate services and factor out store-specific logic as much as possible. So editors/by-country would go into geo_analytics or something like that.

Finally, naming. For better or worse, the dataset is called mediawiki_history, and we named it that, so we're biased. If you wanted something more specific we could maybe go with mediawiki_olap or mediawiki_analytics? I don't know if @JAllemandou or @mforns have thoughts on this since we worked on it?

Circling back to this finally and reading @Milimetric comment in detail. I mostly agree. Some thoughts after further consideration:

I like the "*_analytics" naming suggestion, because it implies connection among the various services, while still being clear in how they differ. Our current naming convention lacks that.

The single "editors" endpoint that hits Cassandra is indeed awkward, however we handle it. For me, the need to spin up a different testing container for Cassandara vs Druid is the deciding factor in keeping it separate from the rest of the "editors" endpoints - I really dislike the idea of requiring devs to fire up both testing envs to run tests locally, and it requiring CI to only spin up one testing env per push (assuming we're able to integrate this testing approach into CI) seems friendlier.

Until now, our service names have followed the url path. If we use the service names from the @Milimetric comment above, we're deviating slightly from that. And I'm starting to feel that's a good thing. Changing the naming to be less strict would be a bit more flexible for future additions. And it would also allow us to use the service name "geoeditor_analytics" for that one awkward endpoint. That seems to me as good a name as any, and provides a reasonably intuitive distinction between that one endpoint and the other "editors" endpoints.

As for the endpoints that hit mediawiki_history, In a comment on T317728: AQS 2.0: Wikistats2: Create and initialize GitLab repository, @Aklapper made some good points about why using any form of the term "wikistats" may be inadvisable. If we keep them all together, I like the "editing_analytics" suggestion.

Putting all that together gives us the following:

endpoint(s)service namedatastore
pageviews/per-article, pageviews/aggregate, top, pageviews/top-by-country, pageviews/top-per-countrypage_analyticsCassandra
unique-devicesdevice_analyticsCassandra
mediarequests/aggregate, mediarequests/per-file, mediarequests/topmedia_analyticsCassandra
editors/by-countrygeoeditor_analyticsCassandra
edits, bytes_difference, edited_pages, editors, registered_usersediting_analyticsDruid

I still wonder if we should break up the editing_analytics endpoints somehow, but I don't see a good way that isn't messy. So maybe we we just leave them all together.

Renaming codebases that we're pretty far along with would normally be disruptive. But it appears we're going to need to switch from GitLab to Gerrit for deployment anyway. So that gives us an at-least-slightly-less annoying opportunity to change the naming, if we decide to.

Opinions welcome.

Edit to add: I typed "geoeditor_analytics", but Dan suggested "geo_analytics". I'm good either way.

It sounds like the first four lines of the above table are non-controversial. This allows us to move forward with moving the current GitLab Unique Devices service to gerrit under the name "device_analytics". That means T320983: Move uniqueDevices service repo from Gitlab to Gerrit is unblocked, and we can proceed with using device_analytics to work through the deployment process.

We may still break up the proposed "editing_analytics" service into multiple services (and therefore can't close this task yet). But that shouldn't affect device_analytics.

I am arriving at the conversation a little late. I am curious about the reason to separate geoeditor from the editing analytics services?

I am arriving at the conversation a little late. I am curious about the reason to separate geoeditor from the editing analytics services?

Mostly because it is Cassandra-backed while the others are Druid-backed.

AQS 2.0 is envisioned as a collection of focused services, rather than a single service. Part of this focus is (or at least, could be) that each service needs to retrieve data from only one location.

A non-obvious implication is that for local development, the AQS 2.0 services fire up a Docker-based environment as a test data source. There is one such environment for Cassandara and another still-somewhat-formative one for Druid. The hope is that these environments/containers/images/whatever-they-become can be adapted for use in CI testing. Requiring that both Cassandra and Druid containers be spun up for testing a single service seemed like unnecessary overhead, especially given that the decision was already made to break out endpoints into separate services, and also given that there was already precedent for a single-endpoint services (Unique Devices/device_analytics).

Does that sufficiently answer the question? Happy to talk more about it.

Looking at this again, I feel like there are no objectively ideal lines on which to divide these endpoints, and that any particular person's preference would depend on their perspective:

  • as someone focused on data/storage, it'd make sense to keep them all together. They all hit the same datastore, so if schemas or storage technology changes, then only one service needs adjusted
  • as someone focused on coding, it'd make sense to divide them by query type and response format. That keeps the code for each service more focused.
  • as someone focused on paths/routing, it'd make sense to divide them by the portion of the path after "metrics", thereby conveniently mapping paths to services

None of those sounds like bad arguments to me. But we have to pick something.

I'm concerned we could end up in a position of paralysis-by-analysis (and maybe already have) on this. So I'm going to make a new proposal, and unless someone absolutely hates it and just can't live with it, I'm going to create implementation tasks and start this moving forward. That doesn't mean I'm trying to act unilaterally and/or am not interested in hearing objections. But if you just have a mild preference for a different division, I'm going to ask you to just live with it. If you have a strong opinion and/or see objective technical reasons why what I propose below is a terrible idea, please do state them. I also realize that I'm posting this in the afternoon of the last workday before the long winter holiday. I won't actually start any of this until after the new calendar year, to give people time to read and comment.

Proposal:

Break the Druid-based endpoints into two services:

  • editor_analytics
  • edit_analytics

This is a reasonably straightforward division to understand, explain, and conceptually map paths to services. It keeps similar code together for cohesion. And by pushing as much code as is reasonably possible into a reusable package, it minimizes the effects of storage/schema changes.

The "reusable package" will contain (at least):

  • two common query functions, to reduce code duplication. These will fill a similar need as the existing revisions/digests functionality.
  • any other common druid-specific helper function we find useful, covering similar functionality as the existing druidUtil.js
  • any common schema definitions that we find convenient, filling similar functionality as the existing mediawiki-history-schemas.yaml. We already have a pattern (used in our service config) for importing yaml info into the Go service, so we may choose to continue using yaml for schema info.

This means the Druid-based endpoints will be divided as follows:

editor_analyticsmetrics/editors/, metrics/registered_users/
edit_analyticsmetrics/edits/, metrics/bytes_difference/, metrics/edited_pages/

If you feel like this is an absolutely horrible idea, please say something. Otherwise, we'll proceed along these lines in January.

[ ... ]

Proposal:

Break the Druid-based endpoints into two services:

  • editor_analytics
  • edit_analytics

This is a reasonably straightforward division to understand, explain, and conceptually map paths to services. It keeps similar code together for cohesion. And by pushing as much code as is reasonably possible into a reusable package, it minimizes the effects of storage/schema changes.

The "reusable package" will contain (at least):

  • two common query functions, to reduce code duplication. These will fill a similar need as the existing revisions/digests functionality.
  • any other common druid-specific helper function we find useful, covering similar functionality as the existing druidUtil.js
  • any common schema definitions that we find convenient, filling similar functionality as the existing mediawiki-history-schemas.yaml. We already have a pattern (used in our service config) for importing yaml info into the Go service, so we may choose to continue using yaml for schema info.

This means the Druid-based endpoints will be divided as follows:

editor_analyticsmetrics/editors/, metrics/registered_users/
edit_analyticsmetrics/edits/, metrics/bytes_difference/, metrics/edited_pages/

If you feel like this is an absolutely horrible idea, please say something. Otherwise, we'll proceed along these lines in January.

LGTM ¯\_(ツ)_/¯