The Wikistats 2 API provides access to edits, edited-pages, editors and newly-registered-users.
See: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
The Wikistats 2 API provides access to edits, edited-pages, editors and newly-registered-users.
See: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T324931 Clean up open RESTBase related tickets | |||
In Progress | None | T262315 <CORE TECHNOLOGY> API Migration & RESTbase Sunset | |||
In Progress | DAbad | T263489 AQS 2.0 | |||
Invalid | BPirkle | T288301 AQS 2.0:Wikistats 2 service | |||
Invalid | FGoodwin | T317728 AQS 2.0: Wikistats2: Create and initialize GitLab repository | |||
Invalid | None | T317729 AQS 2.0: Wikistats2: Implement endpoints | |||
Invalid | None | T317730 AQS 2.0: Wikistats2: Implement Unit Tests | |||
Invalid | None | T317731 AQS 2.0: Wikistats2: Implement Test Framework | |||
Invalid | None | T317732 AQS 2.0: Wikistats2: Create OpenAPI Spec | |||
Resolved | Spike | SGupta-WMF | T319686 Define technical requirements for remaining scope of AQS 2.0 | ||
Resolved | Spike | VirginiaPoundstone | T328614 AQS 2.0 remaining scope work breakdown structure |
Quick note on this one: the backend for these metrics is not cassandra but druid, making implementation somehow different. Happy to talk about this more as needed.
Yes, we were (so far) putting a pin here and planning to circle back (because it is more opaque to us than the others). For example: Ideally the dev/test setup (T288160) would have Druid running with test data in place, and I'm not (yet) sure how deep that rabbit hole goes.
Wikistats is the only one that uses Druid though, yes?
@JAllemandou , we're finally getting ready to do actual work on this endpoint and could use some advice, specifically on best practices for testing.
For the endpoints that pull data from Cassandra, we have a Docker Compose test environment that Frankie and Erik set up. It lets us conveniently run local Cassandra with known test data against which to execute our (still somewhat formative) unit tests.
We're wondering if (1) it is feasible to do something similar for Druid, and (2) if there is existing work we could leverage, or if (3) we should use a different approach altogether for Druid.
My personal knowledge of Druid consists of playing one in D&D when I was a teenager, which so far has proved less than helpful. Although sometimes it feels like I've cast Entangling Roots on myself. ;-)
I see there is a apache/druid image on Docker Hub. I also see some bits about fake-druid in the existing production repo.
We can continue digging into all this on our own, but thought we'd ask before we burned too much time, in case there was an existing path (or at least an idea about one), so that we at least start in the right general direction.
[edited with corect link- thanks @Milimetric ] Hi @BPirkle - I can't help with the entangled roots unfortunately - the poor warrior I am would not deal with any magic by any mean :)
As for the system behind AQS, I have used two ways of testing when I have developed the endpoint (it was quite some time ago!):
The way the code for the mediawiki-history endpoint is built is using a Druid query DSL to help generate queries: https://github.com/wikimedia/analytics-aqs/blob/master/lib/druidUtil.js
I don't know how you plan on building for the new endpoint but the DSL has proven useful in the previous case.
Also, in the existing version, multiple external endpoints in https://github.com/wikimedia/analytics-aqs/tree/master/v1 (to be precise bytes-difference, edited-pages, editors, edits and registered-users) were all redirecting toward a single internal module in https://github.com/wikimedia/analytics-aqs/blob/master/sys (mediawiki-history-metrics.yaml and mediawiki-history-metrics.yaml). I wonder if the same pattern is be replicated or not ...
Happy to discuss more on anything related, don't hesitate to ping :)
It took both @JAllemandou 's comments above and @Milimetric 's comments on T311190 before I really understood how the existing production system works. At least, I think I understand it now, so I'm going to restate it and someone can correct me if I'm still off-base.
wikistats2 contains the following endpoints:
/metrics/edits/
/metrics/bytes-difference/
/metrics/edited-pages/
/metrics/editors/
/metrics/registered-users/
(Some of these offer various statistics under subpaths such as metrics/edited-pages/top-by-edits vs metrics/edited-pages/top-by-net-bytes-difference, see the docs for details.)
All this data ultimately comes from Druid, but it does so (at least in some cases) by way of these internal (non-public) endpoints:
/digests/
/revisions/
We can see this in places like this yaml file.
Whether or not we use this model in AQS 2.0 is still TBD, but it'd sense to consider some variation on it. I'm not sure whether exposing a private endpoint fits our model, or if we might instead use something like common functions or a common library to accomplish something similar. But we might use the existing production system to help us identify what functionality would be useful to group.
Here's my current understanding of the full set of endpoints we need for AQS 2.0 (including all endpoints, not just wikistats2):
https://docs.google.com/spreadsheets/d/1nl-4zjd5OfbgINsVGwEc5jh5_xEexz8H7-c5ZIFpopk/edit#gid=0
@BPirkle: sorry so late. Only one tiny misunderstanding left, I think. Wikistats 2 pulls data from all the endpoints you have in your spreadsheet. For example:
pageviews: https://stats.wikimedia.org/#/all-projects/reading/total-page-views
mediarequests: https://stats.wikimedia.org/#/all-projects/content/total-mediarequests
editors: https://stats.wikimedia.org/#/ro.wikipedia.org/contributing/editors
This may be useful if you want to see data flowing, because you can navigate the interface and log XHRs on the console to see how it's hitting the APIs.
Thanks! This is actually great timing, because we're close to starting what we were about to call the "Wikistats 2" service and are therefore finally starting to think in detail about it.
I was already wondering if "Wikistats 2" was really the best name, and also if it would be more consistent with how we're approaching other endpoints in AQS 2.0 to subdivide the remaining endpoints into separate services. (I had a conversation with Eric to that effect, and he didn't disagree.) It seems to me like we use the term "Wikistats 2" a little imprecisely right now (or are at least in danger of doing so), by having it refer to both a user-facing client and to a subset of endpoints used by that client. So maybe this is an opportunity to clarify.
There are several lines that we could split on. Thus far, we've been more-or-less splitting on RestBASE module boundaries, which correspond roughly to public paths (at least, the path component right after "metrics/" in urls like "/metrics/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}". That's reasonable, works pretty well, and I don't think we need to change much that we've already done. (We might change the editors service, see below.)
However, what we have been calling "wikistats 2" doesn't divide quite so cleanly. And in fact, the term "wikistats" does not appear in the existing production code. Some of its public-facing paths that look similar are, from a db query perspective, very different. Some endpoints that involve very similar queries are located at very different paths. So if we group by path, we end up duplicating query code. If we group by similar query, we end up splitting similar paths into several services.
The existing production AQS code grapples with this by having private endpoints at paths "/revisions/" and "/digests/". The public facing endpoints internally either handle the request themselves, or point at either /revisions/ or /digests/. Then the code that actually handles "/revisions/" and "/digests/" makes one kind of query. In this way, endpoints located at very different public paths but with similar query needs don't duplicate db query code.
Here's a little table that hopefully explains this better than all those words:
public path | private path (s) |
bytes-difference | revisions |
registered-users | registered-users |
edits | revisions |
edited-pages | edited-pages, revisions, digests |
editors | editors, revisions, digests, plus a Cassandra-based endpoint |
When the table says something like "registered-users" as a private path for the registered-users public path, it just means that registered-users has its own query code (instead of using revisions or digests). So our messiest public path, "editors", has some subpaths handled by its own query code, some handled internally by "/revisions/", some handled internally by "/digests/", and one that hits Cassandra instead of Druid and lives in a completely separate service.
Looking at this another way, here are the public paths that each of "/revisions/" and "/digests/" handles:
revisions | digests |
edited-pages | edited-pages |
editors | editors |
bytes_difference | |
edits | |
We could do a similar grouping without the internal paths, by instead having helper functions in the service that make the similar queries. But that leaves us putting 17 of our 27 AQS 2.0 endpoints in one service. If part of our goal was to try using something more like microservices, this starts feeling less "micro".
Or we could put the db query code into a different repository and import that into multiple services, broken up in whatever way we think makes sense.
The "editors" path is the trickiest, because most of its subpaths hit Druid data, but one hits Cassandara data. The way we've divided up testing environments makes it inconvenient (but not impossible) to have one "microservice" execute tests against multiple testing environments. At a minimum, it means that testing a service that hits both Cassandra and Druid means spinning up two Docker environments, which is slower and more memory intensive than spinning up just one.
Options:
That was a lot of words. We should decide what we want to do in the next few days, so I can adjust tickets for next sprint. But we don't have to decide today. So please read over this, think about the options, see if you have any questions or different/additional suggestions, and we can decide how to proceed.
(Also, Dan or Joseph, please correct me if you see any of that I got wrong!)
@BPirkle: that's well summarized, I think you captured the messiness of this in ways that I didn't see as we were building it. I think the cleanest way forward is to group endpoints into services based on query similarity, but not to force it where it's not natural. Basically, your option 1 with a tweak I'm going to think out loud about below.
I hope we don't change data stores often, but if we moved mediawiki_history from druid to clickhouse, for example, the organization in option 1 would be the easiest to interact with. In simpler cases, like if Druid launches a new feature, the same reasoning applies. And endpoints themselves are very unlikely to change in ways that are unrelated to their underlying storage. So option 1 seems solid. And that leaves two questions: how does that look for the rest of the datasets we query and, of course, naming.
First, how rigid should the endpoint -> service -> dataset -> data store grouping be? A service with all the endpoints querying mediawiki-history makes sense, but editors/by-country makes sense in there too. But it queries Cassandra. Writing this down so I can see it:
endpoints | service | source data | datastore |
bytes-difference, registered-users, edits, edited-pages, editors (except editors/by-country) | editing_analytics | mediawiki_history | Druid |
pageviews (per article, aggregate, etc.) | page_analytics | pageview_hourly | Cassandra |
mediarequests (top, per file) | media_analytics | mediarequest_hourly | Cassandra |
editors (just editors/by-country) | ?? | geoeditors | Cassandra |
It doesn't really make sense to group editors/by-country in page_analytics just because it queries Cassandra, but it might get lost in editing_analytics. I can see two options:
Finally, naming. For better or worse, the dataset is called mediawiki_history, and we named it that, so we're biased. If you wanted something more specific we could maybe go with mediawiki_olap or mediawiki_analytics? I don't know if @JAllemandou or @mforns have thoughts on this since we worked on it?
Circling back to this finally and reading @Milimetric comment in detail. I mostly agree. Some thoughts after further consideration:
I like the "*_analytics" naming suggestion, because it implies connection among the various services, while still being clear in how they differ. Our current naming convention lacks that.
The single "editors" endpoint that hits Cassandra is indeed awkward, however we handle it. For me, the need to spin up a different testing container for Cassandara vs Druid is the deciding factor in keeping it separate from the rest of the "editors" endpoints - I really dislike the idea of requiring devs to fire up both testing envs to run tests locally, and it requiring CI to only spin up one testing env per push (assuming we're able to integrate this testing approach into CI) seems friendlier.
Until now, our service names have followed the url path. If we use the service names from the @Milimetric comment above, we're deviating slightly from that. And I'm starting to feel that's a good thing. Changing the naming to be less strict would be a bit more flexible for future additions. And it would also allow us to use the service name "geoeditor_analytics" for that one awkward endpoint. That seems to me as good a name as any, and provides a reasonably intuitive distinction between that one endpoint and the other "editors" endpoints.
As for the endpoints that hit mediawiki_history, In a comment on T317728: AQS 2.0: Wikistats2: Create and initialize GitLab repository, @Aklapper made some good points about why using any form of the term "wikistats" may be inadvisable. If we keep them all together, I like the "editing_analytics" suggestion.
Putting all that together gives us the following:
endpoint(s) | service name | datastore |
pageviews/per-article, pageviews/aggregate, top, pageviews/top-by-country, pageviews/top-per-country | page_analytics | Cassandra |
unique-devices | device_analytics | Cassandra |
mediarequests/aggregate, mediarequests/per-file, mediarequests/top | media_analytics | Cassandra |
editors/by-country | geoeditor_analytics | Cassandra |
edits, bytes_difference, edited_pages, editors, registered_users | editing_analytics | Druid |
I still wonder if we should break up the editing_analytics endpoints somehow, but I don't see a good way that isn't messy. So maybe we we just leave them all together.
Renaming codebases that we're pretty far along with would normally be disruptive. But it appears we're going to need to switch from GitLab to Gerrit for deployment anyway. So that gives us an at-least-slightly-less annoying opportunity to change the naming, if we decide to.
Opinions welcome.
Edit to add: I typed "geoeditor_analytics", but Dan suggested "geo_analytics". I'm good either way.
It sounds like the first four lines of the above table are non-controversial. This allows us to move forward with moving the current GitLab Unique Devices service to gerrit under the name "device_analytics". That means T320983: Move uniqueDevices service repo from Gitlab to Gerrit is unblocked, and we can proceed with using device_analytics to work through the deployment process.
We may still break up the proposed "editing_analytics" service into multiple services (and therefore can't close this task yet). But that shouldn't affect device_analytics.
I am arriving at the conversation a little late. I am curious about the reason to separate geoeditor from the editing analytics services?
Mostly because it is Cassandra-backed while the others are Druid-backed.
AQS 2.0 is envisioned as a collection of focused services, rather than a single service. Part of this focus is (or at least, could be) that each service needs to retrieve data from only one location.
A non-obvious implication is that for local development, the AQS 2.0 services fire up a Docker-based environment as a test data source. There is one such environment for Cassandara and another still-somewhat-formative one for Druid. The hope is that these environments/containers/images/whatever-they-become can be adapted for use in CI testing. Requiring that both Cassandra and Druid containers be spun up for testing a single service seemed like unnecessary overhead, especially given that the decision was already made to break out endpoints into separate services, and also given that there was already precedent for a single-endpoint services (Unique Devices/device_analytics).
Does that sufficiently answer the question? Happy to talk more about it.
Looking at this again, I feel like there are no objectively ideal lines on which to divide these endpoints, and that any particular person's preference would depend on their perspective:
None of those sounds like bad arguments to me. But we have to pick something.
I'm concerned we could end up in a position of paralysis-by-analysis (and maybe already have) on this. So I'm going to make a new proposal, and unless someone absolutely hates it and just can't live with it, I'm going to create implementation tasks and start this moving forward. That doesn't mean I'm trying to act unilaterally and/or am not interested in hearing objections. But if you just have a mild preference for a different division, I'm going to ask you to just live with it. If you have a strong opinion and/or see objective technical reasons why what I propose below is a terrible idea, please do state them. I also realize that I'm posting this in the afternoon of the last workday before the long winter holiday. I won't actually start any of this until after the new calendar year, to give people time to read and comment.
Proposal:
Break the Druid-based endpoints into two services:
This is a reasonably straightforward division to understand, explain, and conceptually map paths to services. It keeps similar code together for cohesion. And by pushing as much code as is reasonably possible into a reusable package, it minimizes the effects of storage/schema changes.
The "reusable package" will contain (at least):
This means the Druid-based endpoints will be divided as follows:
editor_analytics | metrics/editors/, metrics/registered_users/ |
edit_analytics | metrics/edits/, metrics/bytes_difference/, metrics/edited_pages/ |
If you feel like this is an absolutely horrible idea, please say something. Otherwise, we'll proceed along these lines in January.
Closing as invalid. We will instead do T327817: Edit Analytics Service and T327818: Editor Analytics Service