Page MenuHomePhabricator

Add support for Category filtering without Participants
Closed, ResolvedPublic5 Estimated Story Points

Description

Spun off from T194707#4620358

To make Category filter more useful—especially for events with no formal signup—we will add the ability to use Category filters without a Participants list. Organizers will be able to filter events by category, by participants or by a combination of both—but they must select at least one filter for each wiki about which they want metrics (to reduce system load and keep reports to a reasonable size).

Note: filer using only the categories named: do not include nested categories.

When there are Categories but NO Participants:

  • Users must supply a minimum of at least one category for each wiki about which they want results. If they don't, they'll get no results for the unfiltered wiki (and will probably see an error message, as per T216280).
  • On Wikipedias, Category filtering will apply only to main namespace pages, which means that when no Participants are supplied, no metrics about local file uploading will be reported for Wikipedias. (Otherwise, this would break our rule that either Participants or Categories must be used, in order to reduce the scope of the results.)
  • On Commons, Category filtering applies to files, as per T214744, and will work fine without Participants.
  • On Wikidata, category filtering is not available; no metrics will be produced about Wikidata unless Participants are supplied.
  • We'll need to impose sane limits on the number of pages to process. The system will count edits by all users, including IPs, but excluding bots.
  • We'll need to test to make sure the revision browser still works as expected.

When there are Participants AND Categories (continues to work as-is):

  • On the Wikipedias for which categories are supplied, the PAGES about which results are returned will be those at the intersection of Participants AND Categories.
    • Results for locally uploaded files, however, will be governed by the Participants filter only (because categories don't affect uploaded files on Wikipedias). I.e., when a Participants list is used, metrics will be produced about all files uploaded to Wikipedias, regardless of whether a category is supplied or which categories are supplied.
  • Users will also get Wikidata metrics (because the Participants list covers the minimum requirement there as well).
  • If categories are supplied on Commons, the FILES about which metrics are returned will be those at the intersection of Participants AND Categories as per T214744.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T205734#4825165, @MusikAnimal wrote:

...we need better error handling. In my case, the job failed, but was still marked as "started" and hence in the interface it continued to look like the job was running.

Leon, does the solution in this ticket suffice? As you can see, it announces when the job has timed out. So would that not address your problem? Or do you need something else here? E.g., is it possible to determine that the job has stalled (and why, perhaps?) and put up an error message for that?

@MusikAnimal It might make sense to have both a job_started and a job_status in the database. Then, you could have logic that says, "If a job is 'started' and has been running for X minutes, mark it as failed." That's sort of poor man's error catcher and cleanup logic. Ideally, the job would catch a timeout and report itself as failed but that might be easier said than done.

MusikAnimal moved this task from In Development to Ready on the Community-Tech-Sprint board.

@MusikAnimal It might make sense to have both a job_started and a job_status in the database. Then, you could have logic that says, "If a job is 'started' and has been running for X minutes, mark it as failed." That's sort of poor man's error catcher and cleanup logic. Ideally, the job would catch a timeout and report itself as failed but that might be easier said than done.

job_started is a bad name, that was actually a boolean indicating whether or not it started. I was thinking we could rename/repurpose the job_submitted_at to be the time of the last status update. There is some existing logic to kill stale jobs, going off of when they were submitted.

In T205734#4825165, @MusikAnimal wrote:

...we need better error handling. In my case, the job failed, but was still marked as "started" and hence in the interface it continued to look like the job was running.

Leon, does the solution in this ticket suffice? As you can see, it announces when the job has timed out. So would that not address your problem? Or do you need something else here? E.g., is it possible to determine that the job has stalled (and why, perhaps?) and put up an error message for that?

Yes! I will go off of that. There are some copy changes I would recommend, I'll comment there.

Going to move this back to Ready and work on T207776 first, which seems to be more important.

@MusikAnimal Cool. I didn't have the full picture. All good.

@jmatazzoni @Mooeypoo Did we come to a decision about whether we are working on this ticket? It's in the sprint.

In our engineering meeting and in my discussions with Joe, we agreed to continue working toward this goal. However, we will deprioritize the optimizations for this feature until we have more of the metrics being generated. The engineers will continue to build those metrics with this feature in mind so we don't have to do a lot of rework when we come to it.

Actually, this is still in the balance. Alex, Moriel and I are talking about it this morning.

@MusikAnimal

Impose sane limits on the number of pages to process. This is the trickiest part.

I think for the sake of simplicity, for now, we can consider this number to be 50k?

The problem isn't deciding on a number, but actually imposing the limit. Currently we do category tracking as a subquery, and MariaDB apparently does not support LIMIT/OFFSET within subqueries. We'll need to construct a different query that somehow gives us the data we want without putting all the category members in memory. I'm not even certain adding a LIMIT would help, but I've not done any experimentation.

If we go with this ticket, the idea would be to use these categories to collect the PageIDs -- after which all the other queries rely on PageIDs that are stored.
We will also not recurse down to any further levels; category support would mean only the given category, without the sub categories.

If that's the case, then wouldn't it be relatively straight forward to construct a separate query that collects all Page IDs from the given categories (no recursion, no need for sub categories, no need to preserve anything in memory, and LIMIT is possible) -- and then let the rest of the operation proceed once we have Page IDs stored out of that?

The rest of the operation uses the Page IDs while being agnostic of where these IDs came from anyways. The only "added" thing we'll need to do is collect participants from these Page IDs, which we already have in the system after Max's work.

@MusikAnimal This seems to me to be somewhat straight forward now with the new scope. Am I missing something, or does that look like a good plan?

Hi @MusikAnimal
We're moving this (out of the Freezer and) into Release 1, so I went over it. I deleted the following items you'd included in your original writeup. With this ticket, Participants becomes a filter—a way of excluding participants—rather than the only means of adding participants. So I don't think most users will see any contradiction in having participants but not using that filter.

  • For the precomputed stats, maybe show "50 editors" and not "50 participants", as otherwise it would say 50 participants in one place, and then in the Participants section it will list 0.
  • Show "na" or "–" for new editors / retention
jmatazzoni renamed this task from [8 hours] Add support for categories without participants to Add support for categories without participants.Feb 15 2019, 10:14 PM
Mooeypoo updated the task description. (Show Details)
jmatazzoni renamed this task from Add support for categories without participants to Add support for Category filtering without Participants.Mar 7 2019, 11:10 PM

Initial findings:

  • Most of the time, categories without participants works like a breeze :)
  • If the category is very large, it can timeout on fetching the number of participants. Getting page IDs and calculating other metrics seems to be fine.
  • Assuming there are a lot of pages, the slowest part of the process seems to be fetching pageviews (though this will never time out, like the SQL queries do). For this I've created T217911.

I've got code ready to create a PR, but I'd like to get https://github.com/wikimedia/eventmetrics/pull/216 merged first, which is broadly related to T205322. This will help with debugging and profiling.

@dom_walden @MusikAnimal I tested a biggish Commons category and got the "Something went wrong" error message. To reproduce this:

  • Go to the testing event "Images from Wiki Loves Monuments 2018 in Brazil" (in the Program "joe's testing events")
  • Or create an event that lasts the entire year of 2018 and uses that category.
  • Click Update
  • Expected results: get the metrics
  • Actual results: the error comes up pretty quickly—too quickly to be about the size of the category, I think, which is about 2800 images.

Subsequent note: I tried the same commons contest for the Basque country with 900 images and it worked great. Then I tried the Philippines, with 1700, and it was fine. Then I tried Switzerland, with 2800, and it failed. So is 2000 images some kind of limit?

Clue: I tried to re-enter the Brazil category, just to make sure it was properly recognized, and got the following error message.

Screen Shot 2019-03-20 at 5.39.41 PM.png (482×1 px, 96 KB)

I was going to put this into a ticket, but thought I'd run it past you guys first, in case there is some obvious issue I didn't know about. Let me know what you want to do.

Interesting. I didn't get an email for either of these errors. That's a separate problem.

The first bug is because one of the images is apparently used on Catalan Wikibooks. I won't try to explain why this throws an error, it just does. Anyway it is fixable, maybe a 1 or 2 pointer.

I can not reproduce the second bug. The error suggests it's trying to save a duplicate category (one that already exists for that wiki). The backend is supposed clear out any duplicate categories to prevent this error from happening, and that worked for me.

In T205734#5042566, @MusikAnimal wrote:

The first bug is because one of the images is apparently used on Catalan Wikibooks. I won't try to explain why this throws an error, it just does. Anyway it is fixable, maybe a 1 or 2 pointer.

By the "first one," I think you mean the "Something went wrong" error message I get for "Images from Wiki Loves Monuments 2018 in Brazil" Are you writing a ticket for the Catalan Wikibooks issue, or do you want me to do it? Moriel says we don't have to wait to Estimate such bugs.

What about the other event that failed for me, Images from Wiki Loves Monuments 2018 in Switzerland‎ ? That, too, fails quickly. What is going on there? Does this need a ticket? (Are you able to get into these events I link to?)

By the "first one," I think you mean the "Something went wrong" error message I get for "Images from Wiki Loves Monuments 2018 in Brazil" Are you writing a ticket for the Catalan Wikibooks issue, or do you want me to do it? Moriel says we don't have to wait to Estimate such bugs.

What about the other event that failed for me, Images from Wiki Loves Monuments 2018 in Switzerland‎ ? That, too, fails quickly. What is going on there? Does this need a ticket?

Both Wiki Loves Monuments events fail for the same reason. Feel free to create a ticket just saying those events failed. The technical reason is too complicated to explain. It's not specifically because of Catalan Wikibooks, I just found it funny that that was the wiki that caused the failure :) Anyway, doesn't matter what the reason is, those events are broken and that would be the subject of the ticket.

(Are you able to get into these events I link to?)

All of CommTech are admins on staging and production, so yes they can see any and every event, always.

@dom_walden FYI I'm moving this back to in development until T218916 is resolved.

Meant to move this to QA, now that T218916 has been merged.

QA on this might be difficult. Keep in mind that "timeouts" are not necessarily bugs, but please do still let us know about them. In production we would be notified of these incidents automatically.

At least initially, I suggest focusing testing on realistic events, and not trying to break the system with a giant category (because it will break) 😉

@MusikAnimal When attempting to include the en.wikivoyage category "Chicagoland" in an event, after I click "Save categories" I get the error:

500: Internal Server Error
The server said: An exception occurred while executing 'INSERT INTO event_category (ec_title, ec_category_id, ec_domain, ec_event_id) VALUES (?, ?, ?, ?)' with params ["Chicagoland", 61612, "en.wikivoyage", 374]: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '374-61612-en.wikivoyage' for key 'ec_event_domains'

I have not found this with any other categories so far.

I also notice that wikidata is not included in "Per-wiki metrics" table. Should it be?

Users must supply a minimum of at least one category for each wiki about which they want results. If they don't, they'll get no results for the unfiltered wiki (and will probably see an error message, as per T216280).

Event with commons, en.wiki, en.wikivoyage, en.wiktionary, wikidata.

No Participants, Categories for some. Only shows figures for wikis with categories.

On Wikipedias, Category filtering will apply only to main namespace pages, which means that when no Participants are supplied, no metrics about local file uploading will be reported for Wikipedias. (Otherwise, this would break our rule that either Participants or Categories must be used, in order to reduce the scope of the results.)

I have seen this.

Also, the sum of metrics for event with category A and event with category B is the same as event with category A + B (except where there may be overlap in categories, such as where the same user has uploaded a file to both category A and B).

On Commons, Category filtering applies to files, as per T214744, and will work fine without Participants.

Comparing figures for commons events with scripts that query the database and api, figures for number of files uploaded, page views to files uploaded, etc. match.

Files which are in more than one category do not get double counted (e.g. https://eventmetrics-dev.wmflabs.org/programs/133/events/344).

On Wikidata, category filtering is not available; no metrics will be produced about Wikidata unless Participants are supplied.

I have seen this. Although you can use the categories filter to search for wikidata categories, but they are not effective.

We'll need to impose sane limits on the number of pages to process. The system will count edits by all users, including IPs, but excluding bots.

Events with categories (but not participants) will include anonymous users (i.e. ip addresses) in the "All edits" lists/reports and they count towards edits and bytes changed metrics (last time I looked) but not to participants. I think this is OK?

We may or may not exclude bots, I did not look.

We'll need to test to make sure the revision browser still works as expected.

I have compared the "All edits" CSV report with scripts that do database queries of the revisions table. They match.

On the Wikipedias for which categories are supplied, the PAGES about which results are returned will be those at the intersection of Participants AND Categories.

Comparing this with the database and API, I only see a few discrepancies which are being dealt with in other tickets. I checked the event summary figures, pages created and paged improved.

Category and Participants have same Participants = no change in results.

Participant and Category participants don't overlap = no results.

Checked that the sum of metrics for event with participant A and event with participant B is the same as the event with participant A + B.

Results for locally uploaded files, however, will be governed by the Participants filter only (because categories don't affect uploaded files on Wikipedias). I.e., when a Participants list is used, metrics will be produced about all files uploaded to Wikipedias, regardless of whether a category is supplied or which categories are supplied.

I have seen events which show uploaded files (in the revisions list) in categories which are not specified in event (e.g. https://eventmetrics-dev.wmflabs.org/programs/76/events/339).

File metrics are 0 for non-commons wikis if no participants are specified.

Users will also get Wikidata metrics (because the Participants list covers the minimum requirement there as well).

Wikidata not included in "Per-wiki metrics" table

If categories are supplied on Commons, the FILES about which metrics are returned will be those at the intersection of Participants AND Categories as per T214744.

Again, matches my own figures from the database and API. Only returning data for files uploaded.

Testing this I have also touched T210775 and T218582. Some related bugs/comments have been raised and more development work is necessary.

I also notice that wikidata is not included in "Per-wiki metrics" table. Should it be?

This appears to now be fixed.

@MusikAnimal When attempting to include the en.wikivoyage category "Chicagoland" in an event, after I click "Save categories" I get the error:

Raised as T219732.

Specification in description appears to be satisfied.

Bugs regarding metrics and other things have been raised separately. I think this can move on.