Page MenuHomePhabricator

Statistics generation when user specifies categories to track metrics
Closed, ResolvedPublic5 Story Points

Description

  1. When user specifies only categories and no participants -
    • Track all edits made to all pages in that category in the specified time period
    • Note: We probably need some sane limits just in case someone adds a huge category
  2. When user specifies categories and participants -
    • Track all edits made by those participants to pages in those categories in the specified time period

Note:

  • If there is a wiki specified for an event but no categories are listed for that wiki and no participants are specified then no stats would be generated for that wiki.
  • If there is a wiki specified for an event but no categories are listed for that wiki but participants are specified, then the behavior for that wiki remains the same as before - stats for that wiki would be generated from the edits made by the participants during the event on that wiki.

Event Timeline

Vvjjkkii renamed this task from Statistics generation when user specifies categories to track metrics to sycaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from sycaaaaaaa to Statistics generation when user specifies categories to track metrics.Jul 2 2018, 4:11 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@Niharika Would want to add an option to include subcategories?

We briefly discussed this in our standup today.

When I wrote the ticket, I did not intend it to include subcategories for the simple reason that it would be a challenge - someone might add a mammoth category and it could go on forever. However Leon mentioned today that he does use subcategories for massviews and it works fine for a depth of 50.

@Mooeypoo had some concerns. Let's discuss the implementation here first and then work on it.

In any case, we should make the depth configurable so we can change it whenever need be.

MusikAnimal added a comment.EditedJul 24 2018, 4:04 AM

In Massviews there is only an option for "include all subcategories", or by default query against a single category. I'm not sure what our users want, but from my experience it's usually zero or all subcategories. This approach seems to work so long as you keep track of what categories you've already traversed. The 50 depth limit is an additional safeguard against stack overflow. I think most categories that people would be using don't go that deep. Some relevant code at https://github.com/MusikAnimal/pageviews/blob/master/public_html/massviews/api.php#L48-L140

BUT, if the category has enough subcategories, you can still run out of memory, even with a depth of 5 or something small. We'd need to impose a limit on the number of categories to process (not just depth), and then also a limit on the pages within those categories.

We might also want a "Use subject page instead of talk page" option, to support WikiProjects, since the relevant categories go on the talk page and not the article itself.

I know it's possible to recursively go through the categories, but whenever we do that, there's usually a bunch of deeper challenges, which is why I have reservations.

  1. Categories in MW aren't a tree, they're a (sometimes a little messed-up) graph of loop-de-loops. Recursion is possible, but is usually not as straight forward.
  2. If we only wanted to show a tree (or graph, or even just 'what categories are included in parent-category X, etc) then it's probably not too bad, but my concern is that we want to check into pages that are in these categories -- that makes it a lot more complex, and has a lot more risk of getting us into a pretty big challenging technical pit here. Articles can have multiple categories, each may be a parent or a child of the requested category, and sometimes even both (!!) and when you look for articles (for the edits that were done) that are inside the graph of connected categories, the operation can get really tangled REALLY quickly. And very expensive and non performant.

There might be ways of overcoming the challenge, but I don't think it's as straightforward. @MusikAnimal, given that, I think the best way is to go into it with a limited test to see if it's actually doable while answering the concerns. This will give us also an idea of how deeply to go and what limitations we might want to put up to make this work right. What do you think?

aezell added a subscriber: aezell.Jul 24 2018, 9:17 PM

Since I'm not afraid to look dumb, I'll wade in. It appears from the code that @MusikAnimal linked that the recursion only happens at the category level and there's a limit of 50 levels/loops/strata/etc. on that recursion. That seems to build the list of categories which is then used as a finite list to get a list of the pages in those categories.

If I'm understanding that correctly, it seems like we would be fairly safe on the category recursion or at least as safe as a limit of 50 is. What we'd need to model is the worst-case query for pages if a large, nested, circular category setup, such as @Mooeypoo describes, is our target category. I don't know the DB schema and indices involved to know what gotchas to look for.

As I understand it, these statistics are generated on-demand. It seems to me that an even worse case might exist if a given setup for an event had multiple categories across multiple wikis that all reached our recursion limit and therefore hit our theoretical worst case for each category/wiki combination. It might not be an issue for the database but it may well be an issue for the code that parses the statistics and generates the output.

MusikAnimal added a comment.EditedJul 24 2018, 10:25 PM

I can tell you the Massviews approach works well, but only because users tend to have some very limited categories/subcategories that they work with. It would seem from the implementation approach, the infinite "loop-de-loops" won't happen, but you could still easily run out of memory trying to recurse through a big giant container category like Category:Cities in the United States by state. Attempting to use that category in Massviews will fail. From my tests, the depth in this case must be no greater than 3. Never mind how many pages there are, or how many circular categories there are, rather the sheer number of categories itself. I think that's the issue anyway :)

The anti-infinite recursion tactic is from this snippet of code:

$newCats = array_diff(
  array_column( $res->fetch_all(), 0 ),
  $allCats
);

I just put a sane limit on that (5,000 categories is PLENTY :), and now Massviews works for Cities in the US:

$newCats = array_slice( array_unique( array_diff(
  array_column( $res->fetch_all(), 0 ),
  $allCats
) ), 0, 5000 );

That doesn't get every single page, because it's not going through every single category, but good enough, methinks. I tried this for a much higher-up category, Cities (in the whole world), and it still didn't fail. It returned roughly 18,800 pages.

But for the record, I don't think our users will want something like Cities in US/planet Earth, or Living people, because that could amount to hundreds of thousands if not millions of pages. There's no way in hell someone is running an editathon will be able to get useful metrics form that.

So, I think if we put a limit on the number of categories to process (5,000 seems to work), and the number of pages to process (maybe 20,000 to match Massviews), then we're in OK shape.

Note also T194707#4448979, which isn't hard to implement, but it will require an additional checkbox in the UI. So each row would have to look something like (going off of F18229193):

[category input field] [wiki dropdown] [x] All subcategories (?) [x] Use subject page instead of talk page (?)

The (?) here could be tooltips to explain what those options are for.

Finally, we should show some message if the number of pages was capped, say:

[WikiProject Women in Red] [en.wikipedia] [x] All subcategories (?) [x] Use subject page instead of talk page (?)
  ⚠️ For performance reasons this category was capped at 18,500 pages.
[Another category        ] [fr.wikipedia] [x] All subcategories (?) [ ] Use subject page instead of talk page (?)
[Foo bar category        ] [de.wikipedia] [ ] All subcategories (?) [x] Use subject page instead of talk page (?)
...

Something like that?

Also, would they not want to see how many pages were created/improved in each category? That adds a whole nother layer of complexity in the UI, so maybe save that for later.

As an aside, in looking at the mockup again, I wonder if we'd want validation to ensure that if any of the category items are blank, we disable the + button to add more. Or, do we appreciate the use case of someone knowing they want to include four categories and then clicking the + button four times before filling in any of the boxes?

Mooeypoo added a comment.EditedJul 24 2018, 10:49 PM

I can tell you the Massviews approach works well, but only because users tend to have some very limited categories/subcategories that they work with. It would seem from the implementation approach, the infinite "loop-de-loops" won't happen, but you could still easily run out of memory trying to recurse through a big giant container category like Category:Cities in the United States by state. Attempting to use that category in Massviews will fail. From my tests, the depth in this case must be no greater than 3. Never mind how many pages there are, or how many circular categories there are, rather the sheer number of categories itself. I think that's the issue anyway :)

Yeah my concern here is that in GrantMetrics, users can have an event that includes multiple wikis (or even "All Wikipedias") and have an event that spans a long time. So far it seems most events are limited to a couple of wikis and a couple of days, but there are events (and we might see them coming in) that span months, if not sometimes a whole year, that touch multiple wikis. This has a potential to get problematic very fast.

And I agree with the general approach you're giving -- as well as the need for messaging, I just think that we need to explore how far we can go with this before we encounter more serious issues.

And we could mitigate those issues with several technical approaches, depending on what we think the users will want, how long they'll need to wait, and, potentially, a way to store/cache results.

For example, if we see that for long-term events, or for events that have multiple wikis with multiple categories, we need to split the tests, we could consider having some asynchronous testing that is then stored.
Numbers in the past don't change anyways, they just get updated as time goes further. So, we could run statistics that are limited to time periods, and add things together (and then store the results).

If I want to know the stats for eventA that took a whole year, odds are I'll be checking the stats every x weeks or so. If we store those results, we could do these queries on the difference -- ask the DB what changes it seems from [LAST TIME WE UPDATED] and until now, and add the results, then store that with the new date.
We can even store the participant names, so we validate that we're checking the same thing.

The good thing about this idea is that we can also do this to all searches that are harder; we store, recheck only what we need to, and move on.

All that said, though, I think it's good we're thinking ahead, but we need to think what the next steps are. A lot of the things that we raised make assumptions (and fears? :P ) about what might be issues. I think our next step should be doing a small-scale change for small-scale events (99% of the events are those anyways right now) and if we see there is a pattern of heavy-search events, we temporarily block them from category searches with an explanation. I think those, for the moment with the current usage will be very few and edge-case'y anyways, and it will allow us to move forward and during the exploration into implementing it, seeing what are the actual challenges we see, and allow us to make better judgment/decision for the next steps.

We might need to scope the MVP of this very tightly as a first step, and/or timebox this as a way to make sure we're not suddenly going down an endless pit without a plan.

As an aside, in looking at the mockup again, I wonder if we'd want validation to ensure that if any of the category items are blank, we disable the + button to add more. Or, do we appreciate the use case of someone knowing they want to include four categories and then clicking the + button four times before filling in any of the boxes?

Ooh, yeah, we should have validation for sure. Also, do we have category searching / autocomplete on these? We might want to use something similar to TagMultiselectWidget (I know GrantMetrics doesn't use OOUI, but there are alternatives) so it's clearer that you add multiple, but search/type once each time.
Anywyas, just piggibacking on @aezell's point here with an expanded idea :p

For example, if we see that for long-term events, or for events that have multiple wikis with multiple categories, we need to split the tests, we could consider having some asynchronous testing that is then stored.
Numbers in the past don't change anyways, they just get updated as time goes further. So, we could run statistics that are limited to time periods, and add things together (and then store the results).
If I want to know the stats for eventA that took a whole year, odds are I'll be checking the stats every x weeks or so. If we store those results, we could do these queries on the difference -- ask the DB what changes it seems from [LAST TIME WE UPDATED] and until now, and add the results, then store that with the new date.

Another nice benefit of this approach is that if we store this information in the right way, we could potentially build graphs of these stats over time which might be interesting for a future iteration. Users love graphs!

We might need to scope the MVP of this very tightly as a first step, and/or timebox this as a way to make sure we're not suddenly going down an endless pit without a plan.

This seems like a good thought to keep in mind as @Niharika has indicated that her initial plans didn't include subcategories at all. Iterating to them cautiously seems like a good approach.

Still, defensive program around the recursion as we already doing is the safe bet.

I'm totally fine if we want to hold off on recursive categories, I just assumed the users would want it. Getting the subcategories isn't that bad, though -- no more than a few seconds if we use that 5,000 category limit. The number of pages returned can be quite large, but so can a single category (such as Living People).

Yeah my concern here is that in GrantMetrics, users can have an event that includes multiple wikis (or even "All Wikipedias") and have an event that spans a long time. So far it seems most events are limited to a couple of wikis and a couple of days, but there are events (and we might see them coming in) that span months, if not sometimes a whole year, that touch multiple wikis. This has a potential to get problematic very fast.

Yep, I've tried to do some things to help with this, but there is definitely room for improvement. We're currently using a job queue to distribute load. So when you request statistics, it won't run them until there's enough DB quota so as to not interrupt other queries. From our tests, generating stats isn't usually a problem. It may take a while, but it doesn't seem to ever fail.

https://tools.wmflabs.org/grantmetrics/programs/Women_in_Red/Women_in_Red_-_February_hackathon is our extreme example. All wikipedias, one year timespan, 500+ participants (including some very prolific cross-wiki editors). I just re-ran it and it took about 90 seconds to generate all the stats. I think this is acceptable, given the UI suggests it's working in the background, and that immediate results shouldn't be expected.

The biggest problem is the so-called "revision browser", accessed via the "View all data" button on the event page. For very large events, lots of wikis, lots users, etc., it can run very slow. It shows every revision (paginated) made as part of the event, so we can't cache it indefinitely as it would require too much storage. For the above example (Women In Red hackathon), the revision browser throws a 500 due to running out of memory :( The revision browser also doesn't use the job queue, which is something that really should be fixed.

There's also a max statement time set to 900 seconds on all queries. This is just a safeguard to auto-kill queries gone wild. Looking at the logs, it does get hit occasionally. We do show an error message.

If I want to know the stats for eventA that took a whole year, odds are I'll be checking the stats every x weeks or so. If we store those results, we could do these queries on the difference -- ask the DB what changes it seems from [LAST TIME WE UPDATED] and until now, and add the results, then store that with the new date.

Another nice benefit of this approach is that if we store this information in the right way, we could potentially build graphs of these stats over time which might be interesting for a future iteration. Users love graphs!

We sort of do that now. All the stats you see are in our own db, and are only refreshed when you request them. Historical data is not kept, however. That's something we've thought about before, and I like the idea, but I don't think it makes as much sense unless we do automated refreshes of the data (T189911). My concern here is organizers who never check back on their past events, and we're still unnecessarily firing away expensive queries.

Mooeypoo added a comment.EditedJul 25 2018, 4:06 AM

https://tools.wmflabs.org/grantmetrics/programs/Women_in_Red/Women_in_Red_-_February_hackathon is our extreme example. All wikipedias, one year timespan, 500+ participants (including some very prolific cross-wiki editors). I just re-ran it and it took about 90 seconds to generate all the stats. I think this is acceptable, given the UI suggests it's working in the background, and that immediate results shouldn't be expected.

I wonder what the number would be if we had categories and recursion to sub-categories in multiple wikis in there... :\

The biggest problem is the so-called "revision browser", accessed via the "View all data" button on the event page. For very large events, lots of wikis, lots users, etc., it can run very slow. It shows every revision (paginated) made as part of the event, so we can't cache it indefinitely as it would require too much storage. For the above example (Women In Red hackathon), the revision browser throws a 500 due to running out of memory :( The revision browser also doesn't use the job queue, which is something that really should be fixed.
There's also a max statement time set to 900 seconds on all queries. This is just a safeguard to auto-kill queries gone wild. Looking at the logs, it does get hit occasionally. We do show an error message.

That's good, but we should try and figure out how best to avoid even getting there, especially as a general use case, which I think the heavy-load part of it might end up being. Bigger events (like the faux example of Women in Red February -- which is actually not that far from actual Women in Red events, and slightly similar ones) might end up getting super close to that limit, if not hitting it outright, and often, if we add in the recursive categories.

If I want to know the stats for eventA that took a whole year, odds are I'll be checking the stats every x weeks or so. If we store those results, we could do these queries on the difference -- ask the DB what changes it seems from [LAST TIME WE UPDATED] and until now, and add the results, then store that with the new date.
Another nice benefit of this approach is that if we store this information in the right way, we could potentially build graphs of these stats over time which might be interesting for a future iteration. Users love graphs!

We sort of do that now. All the stats you see are in our own db, and are only refreshed when you request them. Historical data is not kept, however. That's something we've thought about before, and I like the idea, but I don't think it makes as much sense unless we do automated refreshes of the data (T189911). My concern here is organizers who never check back on their past events, and we're still unnecessarily firing away expensive queries.

This is a good question, and a good challenge to solve. I have some thoughts on how to maybe try and do this, but I think it should be a separate question, especially if we want to do this automatically to all results as T189911: Consider automatic updating of event metrics suggests, so I'll wait with that until we're ready to delve into that ticket :)

MusikAnimal added a comment.EditedJul 25 2018, 4:50 AM

That's good, but we should try and figure out how best to avoid even getting there, especially as a general use case, which I think the heavy-load part of it might end up being. Bigger events (like the faux example of Women in Red February -- which is actually not that far from actual Women in Red events, and slightly similar ones) might end up getting super close to that limit, if not hitting it outright, and often, if we add in the recursive categories.

Yeah the revision browser thing is not that well-suited for big events. For this we're doing a UNION for each wiki then putting a LIMIT on the combined results. I'm not sure of a better way to do it if we want to sort by timestamp and not wiki. Worse is attempting to download the CSV or Wikitext, which is meant to include all revisions.

Indeed adding recursive categories into the mix could make the revision browser unusable (though I still think doing the COUNTs for the stats will be feasible). We'd at least cache all the subcategories for say, 10 minutes, so that much doesn't have be required when browsing through the pages of results. That said, if no one has requested that we support a "include subcategories" option, there's no reason to spend too much time on it.

Overall, if we want Grant Metrics to scale we'll probably need a Cloud VPS instance where we have considerably more RAM. We might even make some of these queries on the analytics.db.svc.eqiad.wmflabs host, which is meant for longer-running queries. Just like with XTools, people who use Grant Metrics are real data-hungry, and sometimes request things that can't feasibly be computed with available infrastructure. Breaking it will always be an option :) But we could at least impose some sane limits. What I hope to do one day is run EXPLAIN on expensive queries in order to the test waters (T188677), and if the query is too expensive, then further limit the results or abort with an error message.

Hello. This discussion has been pretty interesting to follow. I think there's several important points that were brought up here so I'll try to summarize them. We can create separate issues to discuss things, as we see fit. The intention is not to fragment the discussion but rather make it more focused on specific, independent issues.

  • Subcategories in search
    • There's some concerns given the weirdness around categories (cyclical categories etc.)
    • Leon left a pretty in-depth explanation of how Massviews deals with this here: T194707#4449135, however Grant metrics is its own beast and we're likely to run into different challenges.
    • Let's wait for Sati's opinion on this (we have a meeting tomorrow) and see if we should do this. In any case, it should not be a part of this ticket but rather have its own.
  • Using subject page instead of talk page when searching by categories
    • Brought up by Leon in T194707#4448979
    • This is pretty important and deserves its own discussion thread. I created T200373 for it.
  • Validation for categories
    • Re: the point Alex made about blank categories in T194707#4449161 - We do a very similar workflow in the app on the program creation screen where users can add organizers by hitting the + button. What happens there is that while saving, all blank fields are simply ignored. We should do the same thing here, when the user hits save, we discard empty fields (if either the wiki or the category is empty/not specified, it gets discarded). You can play around with the tool here to see.
    • About auto-completion of categories. It's a nice-to-have but let's save it for future.
    • About validation for the category itself - I think there is a case to be made for when an organizer might want to go in and add categories not yet created on the wiki in the tool ahead of the event. I will flag this for Sati's input.
  • Improving the way stats are updated for more efficiency
    • Moriel suggested a good approach for improving the way we currently generate stats in T194707#4449165
    • This is a good thing to keep in mind for when/if the tool starts becoming slow for large events
  • Graphs of stats
    • Brought up by Alex in T194707#4449177
    • This is actually something that came up a while ago (documented in T189917). We don't have any immediate plans of working on it but that can change as we get more user feedback.
  • Revision browser and categories
    • First brought up in T194707#4449342
    • My initial thought is that by adding categories, the number of edits would only go down if anything. Because it's an intersection of the participant edits on pages in those categories. But the query time will be longer, of course.
    • This is something we should break off into a sub-ticket if we see the revision browser becoming unusable as things progress.

About auto-completion of categories. It's a nice-to-have but let's save it for future.

It was my intention to add this with the same PR. Maybe 5-10 lines of code :)

About validation for the category itself - I think there is a case to be made for when an organizer might want to go in and add categories not yet created on the wiki in the tool ahead of the event. I will flag this for Sati's input.

Just as with usernames, categories can be renamed. For this reason we should be storing IDs. They can come back later and add the category whenever they have created it on-wiki.

Revision browser and categories
This is something we should break off into a sub-ticket if we see the revision browser becoming unusable as things progress.

It already is virtually unusable for some real events, e.g. https://tools.wmflabs.org/grantmetrics/programs/les_sans_pagEs/les_sans_pages_%28All%29. Attempting browse the revisions is futile. There is also one event created by Pharos at WMNYC for which the revision browser errors out completely, but that appears to be a test. At any rate, it's already proven itself that this feature doesn't scale :(

Revision browser and categories
This is something we should break off into a sub-ticket if we see the revision browser becoming unusable as things progress.

It already is virtually unusable for some real events, e.g. https://tools.wmflabs.org/grantmetrics/programs/les_sans_pagEs/les_sans_pages_%28All%29. Attempting browse the revisions is futile. There is also one event created by Pharos at WMNYC for which the revision browser errors out completely, but that appears to be a test. At any rate, it's already proven itself that this feature doesn't scale :(

I agree that we should probably make this a new ticket and that it should be fixed. Maybe there's something in the way we do the pagination that could help with the errors. What if we did a sort of infinite scroll type of behavior where we don't need to know how many there are? We just let the user page through until they are finished. If the API/function that provides the data lets us grab the first X number of items without needing to know how many there are, it might be a feasible band-aid. It seems like that code then wouldn't have to run the giant query to get ALL of the data.

Niharika set the point value for this task to 5.Jul 25 2018, 11:28 PM

@Niharika - I know we're talking soon, but I wanted to provide some examples re: category depth/sub-category topic.

Category depth is a really important feature if you're going to use categories to collect metrics. One of the most used tools for this is GLAMourous 2, which is a simple tool that just tells you some simple stats about that category. But everyone I've talked with mentions the "depth" part as integral to it's usefulness.

Here are a few examples (all from Commons, because its easier) to show you how sub-categories are being used:

  • Extensive (and well organized) category trees, like WLM. When a program gets to this size, some move onto more advanced tools (like SPARQL) or specialized tools (like Montage). BUT the ability for a program to move onto these advance or specialized tools depends on them having someone in their group who knows this tech (or can get a new tool developed), and that's not common.
  • Very simple categories, like with Just for the Record, who has a "main" category, but each sub-category is basically an event.
  • Mid-sized programs where sub-categories have sub-categories, for reasons that vary, like they group things by year or topic, like AfroCROWD.

Revision browser and categories
This is something we should break off into a sub-ticket if we see the revision browser becoming unusable as things progress.

It already is virtually unusable for some real events, e.g. https://tools.wmflabs.org/grantmetrics/programs/les_sans_pagEs/les_sans_pages_%28All%29. Attempting browse the revisions is futile. There is also one event created by Pharos at WMNYC for which the revision browser errors out completely, but that appears to be a test. At any rate, it's already proven itself that this feature doesn't scale :(

I agree that we should probably make this a new ticket and that it should be fixed. Maybe there's something in the way we do the pagination that could help with the errors. What if we did a sort of infinite scroll type of behavior where we don't need to know how many there are? We just let the user page through until they are finished. If the API/function that provides the data lets us grab the first X number of items without needing to know how many there are, it might be a feasible band-aid. It seems like that code then wouldn't have to run the giant query to get ALL of the data.

I like that idea. Another thing we could do is to break up the data by wikis or wiki families so that we don't pile everything in one gigantic query. We can make the user see one wiki/wiki family at a time. I'm not quite sure how useful seeing all data together is anyway.

I'm not quite sure how useful seeing all data together is anyway.

Sorting by date, is the only thing. I see this as quite useful, but it may not be feasible here.

Niharika updated the task description. (Show Details)Aug 1 2018, 4:28 AM

@MusikAnimal I made an update to the description. Take a look.

MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.
MusikAnimal added a comment.EditedAug 7 2018, 1:47 AM

FYI I just created T201377: Improve performance of the Edit List. To save myself from doing work that will have to be reworked later, I'm going to sidestep adding support for categories in the revision browser.

PR: https://github.com/wikimedia/grantmetrics/pull/88

Note: We probably need some sane limits just in case someone adds a huge category

Currently I'm doing everything with one query, like:

SELECT COUNT(DISTINCT(page_title)) AS edited,
    IFNULL(SUM(CASE WHEN rev_parent_id = 0 THEN 1 ELSE 0 END), 0) AS created
FROM enwiki_p.page
INNER JOIN enwiki_p.revision_userindex ON rev_page = page_id
WHERE page_namespace = 0
AND rev_timestamp BETWEEN 20170609040000 AND 20180609035959
AND rev_user_text IN ('MusikAnimal', 'Samwilson')
AND page_id IN (
    SELECT DISTINCT(cl_from)
    FROM enwiki_p.categorylinks
    INNER JOIN enwiki_p.page ON cl_from = page_id
    WHERE page_namespace = 0
    AND cl_to IN (
        SELECT cat_title
        FROM enwiki_p.category
        WHERE cat_id IN (173)
    )
)

so there is no cap on the number of pages. The version of MariaDB that we're using actually doesn't even support a LIMIT within an IN subquery.

The above query is for a period of one year, scanning Category:Living people, which contains over 870,000 pages. This is an extreme example and it ran 4.35 seconds:

+--------+---------+
| edited | created |
+--------+---------+
|     77 |       1 |
+--------+---------+
1 row in set (4.35 sec)

So I think we're okay on the number of pages? This makes me confident about recursive categories. We just need IDs, and we can fit a lot of those in the cat_id IN query and presumably it won't be that bad. With Massviews (T194707#4449135) I am fetching the category names first, and querying against those with cat_to IN ('cat_one', 'cat_two'). That seems to be the problem. I'll have to update Massviews to use the above approach.

MusikAnimal moved this task from Backlog to In progress on the Grant-Metrics board.Aug 9 2018, 8:18 PM
aezell added a comment.Aug 9 2018, 8:40 PM

This seems like good data for us to proceed with. Thanks!

MusikAnimal closed this task as Resolved.Aug 16 2018, 6:53 PM

This has been merged. There's no frontend so nothing to test QA-wise yet.

MusikAnimal moved this task from In progress to Done on the Grant-Metrics board.Aug 16 2018, 7:48 PM
MusikAnimal reopened this task as Open.Sep 26 2018, 7:26 PM
MusikAnimal moved this task from Q1 2018-19 to In Development on the Community-Tech-Sprint board.
MusikAnimal added a subscriber: jmatazzoni.

When user specifies only categories and no participants -

  • Track all edits made to all pages in that category in the specified time period
  • Note: We probably need some sane limits just in case someone adds a huge category

I completely forgot about this.

@Niharika @jmatazzoni A few questions:

  • Do we only get stats for registered users? Or do we count IP edits, too?
  • Exclude bots?
  • Do we count the number of editors, across the board, and store that as the # of participants? This is a bit weird, because it may say "50 participants" but the "Participants" section doesn't show anyone. Auto-creating the participant list would be very complicated, and in my opinion even more confusing. I'm thinking that for the precomputed stats, we show "50 editors" and not "50 participants". After all, you can't really assume everyone was participating in the event. They could include drive-by patrollers or even vandals.
  • What about number of "new editors" or "7-day retention"? This is going to be pretty expensive to calculate :(

In T194707#4620358, @MusikAnimal wrote:

When user specifies only categories and no participants ...A few questions:

  • Do we only get stats for registered users? Or do we count IP edits, too?

If there is no Participant list, then it would make sense to track IP users as well if we can.

  • Exclude bots?

Yes, exclude bots.

  • Do we count the number of editors, across the board, and store that as the # of participants? This is a bit weird, because it may say "50 participants" but the "Participants" section doesn't show anyone. Auto-creating the participant list would be very complicated, and in my opinion even more confusing. I'm thinking that for the precomputed stats, we show "50 editors" and not "50 participants". After all, you can't really assume everyone was participating in the event. They could include drive-by patrollers or even vandals.

Do not auto-create the participant list. Leave it blank. As to whether it's wrong to call these users "participants," making the distinction you suggest might be nice (in some future release), but I am concerned that we will make ourselves crazy trying to keep it straight everywhere as we expand the number of reports we're going to produce. My sense is that people who don't include a participant list are content to count everyone who contributes as a being (at least potentially) involved.

  • What about number of "new editors" or "7-day retention"? This is going to be pretty expensive to calculate :(

If there is no Participant list, I think it's fine to not give figures for retention or new editors. I say this not merely because it suits our convenience. My sense is that if you're not signing people up, you are probably more interested in the content than in the users per se. E.g., content drives like Wiki Loves Monuments are not focused on recruitment, compared to other events, which specifically have a training focus.

  • It would be best, however, not to simply leave these metrics fields blank. When we omit stats in this way can we please mark as "na" ?
jmatazzoni added a subscriber: Bluerasberry.EditedSep 28 2018, 12:09 AM

My comment in a related ticket seems relevant to these questions. There, I suggest limiting subcategories to three levels.

Also relevant is an interesting discussion of similar issues with @Bluerasberry on the Event Tool talk page under 'Comments about New data and reports in detail' (start reading at Sadads comment through my second answer to Bluerasberry).

MusikAnimal closed this task as Resolved.Sep 28 2018, 5:58 PM
MusikAnimal moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.

Thanks Joe! This is a fair amount of work, so I have created T205734, and have replied to your comments there. I guess this can be re-closed.