Page MenuHomePhabricator

Pin down the exact queries we'll be needing for computing metrics for events
Closed, ResolvedPublic5 Estimated Story Points

Description

According to the specs we'll be computing these metrics for each event:

  • Participants
  • New editors
  • 7+ days retention
  • Pages created
  • Pages improved (i.e. edited)

This task is for determining the exact query we'll be doing for each of these. This might require some digging around wikimetrics.

Event Timeline

kaldari triaged this task as Medium priority.Dec 6 2017, 12:39 AM
kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari set the point value for this task to 5.

Participants: Number of enrolled participants.
(Already done)


New editors: Number of participants that enrolled within the 15 days preceding the event start.

$participants = Array of participant user IDs for given event
$start = Start date for given event

USE centralauth;
SELECT COUNT(*) FROM globaluser 
WHERE gu_id IN $participants 
AND gu_registration > ($start - 15);

Using the API seems slightly faster but it doesn't accept multiple users in one request.


7+ days retention: Number of enrolled participants who have made an edit after seven days since the event end. Between the event start and end date + 7 days, all participants will be considered retained. The edit may be on any wiki and any namespace.

This is tricky. There's no API for this. And centralauth DB doesn't apparently contain this data. One way to get this data seems to be the guc tool. The underlying code doesn't have an API for this yet but it does seem extensible enough to support it, if we want to modify it.


Pages created: New pages created by the enrolled participants in -

  • main namespace
  • between the event dates (inclusive)
  • on given wikis

The query for computing this would need us to query databases for every wiki in the wiki list and do a revision and page table join. Instead of reinventing the wheel, I'm thinking we can make use of the XTools pages created API for this. I don't think it would be significantly slower compared to implementing our own query. We need to do a query for each of the wikis in our list and filter out the response by page creations dates that fall within our event window.


Pages improved: Number of distinct pages edited by enrolled participants in -

  • main namespace
  • between the event dates (inclusive)
  • on given wikis

Seems like another thing we can exploit the XTools' wonderful API for. It accepts dates as parameters so we can pass in the event dates. We need to do a query for each user for each of the wikis in our list and compile a distinct list of the pages edited by the users which we can then return the count of.

I like the idea of using the XTools API! And I love that you called it "wonderful" :) However here we want pages created between all the users, no? In that case it'd probably be faster and more efficient to do our own queries.

Going through the list:

New editors: Number of participants that enrolled within the 15 days preceding the event start.

Would that not include new accounts created during the event? Since they would still be "new", per se. We could do gu_registration BETWEEN DATE_SUB(:startDate, INTERVAL 15 DAY) AND :endDate (or just pass in the value for 15 days prior).

7+ days retention: Number of enrolled participants who have made an edit after seven days since the event end. Between the event start and end date + 7 days, all participants will be considered retained. The edit may be on any wiki and any namespace.

I suppose we'd have to loop through all the wikis we're targeting, and do something like:

SELECT COUNT(DISTINCT(rev_user))
FROM revision_userindex
WHERE rev_user_text IN (:usernames)
AND rev_timestamp > DATE_ADD(:endDate, INTERVAL 7 DAY)

At this point we will probably already have the usernames cached (from the view), but if not that's a fast query on its own.

XTools has global user edits functionality as well (example). This intentionally is not exposed as a public API because it is too slow. The GUC tool is also slow-ish, but both are fetching revision information, when we're just doing a COUNT, so hopefully not as bad.

Pages created: New pages created by the enrolled participants

With potentially hundreds of participants, and a whole lotta wikis, it's probably better to do our own query than use XTools. We'd again loop through each wiki, doing something like:

SELECT COUNT(page_title)
FROM page
JOIN revision_userindex ON page_id = rev_page
WHERE page_namespace = 0
AND rev_parent_id = 0
AND rev_timestamp BETWEEN :startDate AND :endDate
AND rev_user_text IN (:usernames)

Pages improved: Number of distinct pages edited by enrolled participants

For this we want pages edited that are NOT any of the pages created as part of the event, right? If so we could do the above query and SELECT page_id, store it temporarily and run:

SELECT COUNT(DISTINCT(page_title))
FROM page
JOIN revision_userindex ON page_id = rev_page
WHERE page_namespace = 0
AND rev_parent_id > 0
AND rev_timestamp BETWEEN :startDate AND :endDate
AND rev_user_text IN (:usernames)
AND page_id NOT IN (:createdPagesIds)

New editors: Number of participants that enrolled within the 15 days preceding the event start.

Would that not include new accounts created during the event? Since they would still be "new", per se. We could do gu_registration BETWEEN DATE_SUB(:startDate, INTERVAL 15 DAY) AND :endDate (or just pass in the value for 15 days prior).

Oh yeah, I thought that was implicit. Sorry about that. Yeah, anyway enrolled between ($start - 15) and $end would count as new. Since this is a metric which won't change once the event has ended, maybe we can save ourselves the bother of recomputing it once it already has been (after event end). This would need some more thought on the logic we'd need for this.

7+ days retention: Number of enrolled participants who have made an edit after seven days since the event end. Between the event start and end date + 7 days, all participants will be considered retained. The edit may be on any wiki and any namespace.

I suppose we'd have to loop through all the wikis we're targeting, and do something like:

SELECT COUNT(DISTINCT(rev_user))
FROM revision_userindex
WHERE rev_user_text IN (:usernames)
AND rev_timestamp > DATE_ADD(:endDate, INTERVAL 7 DAY)

At this point we will probably already have the usernames cached (from the view), but if not that's a fast query on its own.

XTools has global user edits functionality as well (example). This intentionally is not exposed as a public API because it is too slow. The GUC tool is also slow-ish, but both are fetching revision information, when we're just doing a COUNT, so hopefully not as bad.

The problem is that we want to consider the editor as retained if they edit on *any* Wikimedia project (all 800+ of them), not just the wikis that were involved in the event. We could do our own query but I figure using the guc tool would be easier. We don't have to worry about it being slow. That was a given from the start. That's why we're saving the metrics in the DB and not triggering an automatic recalculation every time someone loads the page.

Here's a snippet of my conversation with Krinkle about the guc tool yesterday (he maintains it) -

<Krinkle> The queries are optimised by grouping together wikis and using a UNION query, given they are hosted on the same server/shard.
<Krinkle> Niharika: Wanna file a task or two? Was thinking maybe one  taskfor exposing the query through the API, and one task for adding a feature (api-only if you like) to get the latest edit only (instead of the default, which is upto 20 from each wiki).
<Krinkle> The current behaviour is to do 1 batch query (7 currently) where we get a true/false from each wiki whether or not the user has edits there, and then we re-use those 7 connections to make one query to 'true' wiki to get the actual edit information. Which is 7 + N queries. But for your case, that initial true/false query could instead query the latest edit :)

Pages created: New pages created by the enrolled participants

With potentially hundreds of participants, and a whole lotta wikis, it's probably better to do our own query than use XTools. We'd again loop through each wiki, doing something like:

SELECT COUNT(page_title)
FROM page
JOIN revision_userindex ON page_id = rev_page
WHERE page_namespace = 0
AND rev_parent_id = 0
AND rev_timestamp BETWEEN :startDate AND :endDate
AND rev_user_text IN (:usernames)

Seems fine to me. We'd probably want to do rev_user IN (:userids) instead probably.
Again, slowness is not a problem so I thought using XTools is easier. Potentially hundreds of participants and wikis, yes, but realistically speaking it seems like most events would restrict to <2 wikis and <200 participants.

Pages improved: Number of distinct pages edited by enrolled participants

For this we want pages edited that are NOT any of the pages created as part of the event, right? If so we could do the above query and SELECT page_id, store it temporarily and run:

SELECT COUNT(DISTINCT(page_title))
FROM page
JOIN revision_userindex ON page_id = rev_page
WHERE page_namespace = 0
AND rev_parent_id > 0
AND rev_timestamp BETWEEN :startDate AND :endDate
AND rev_user_text IN (:usernames)
AND page_id NOT IN (:createdPagesIds)

We're doing rev_parent_id > 0 so new pages created should be automatically excluded, right?

7+ days retention: Number of enrolled participants who have made an edit after seven days since the event end. Between the event start and end date + 7 days, all participants will be considered retained. The edit may be on any wiki and any namespace.

The problem is that we want to consider the editor as retained if they edit on *any* Wikimedia project (all 800+ of them), not just the wikis that were involved in the event. We could do our own query but I figure using the guc tool would be easier. We don't have to worry about it being slow. That was a given from the start. That's why we're saving the metrics in the DB and not triggering an automatic recalculation every time someone loads the page.

Ah, I see. We can loop through all the wikis, not any harder, just longer run time (but as you say that's fine). GUC gets contributions when we just want a COUNT, right? It also only does one user at a time. It'd be neat to get an API added to it, but I don't think it would help us that much.

I also want rock solid test coverage on the statistics, and using production data in the CI build I don't think will work. Right now we have only a few test wikis, with just enough data to make assertions.

Pages created: New pages created by the enrolled participants

Seems fine to me. We'd probably want to do rev_user IN (:userids) instead probably.

Yeah I thought about that, but it means we'd need query centralauth_p.localuser inaddition to globaluser (and again we likely will already have the usernames in the cache). I'll do some testing and choose whatever is faster!

Pages improved: Number of distinct pages edited by enrolled participants

We're doing rev_parent_id > 0 so new pages created should be automatically excluded, right?

The initial revision that created the page, yes. But we'd still need to exclude all other revisions to that page, if that's what we want. I suppose here rev_parent_id > 0 just acts a safeguard, as the page_id NOT IN (:createdPagesIds) should discount any pages they created. I haven't really tested this query, but I think it will work. There may be a better way.

7+ days retention: Number of enrolled participants who have made an edit after seven days since the event end. Between the event start and end date + 7 days, all participants will be considered retained. The edit may be on any wiki and any namespace.

The problem is that we want to consider the editor as retained if they edit on *any* Wikimedia project (all 800+ of them), not just the wikis that were involved in the event. We could do our own query but I figure using the guc tool would be easier. We don't have to worry about it being slow. That was a given from the start. That's why we're saving the metrics in the DB and not triggering an automatic recalculation every time someone loads the page.

Ah, I see. We can loop through all the wikis, not any harder, just longer run time (but as you say that's fine). GUC gets contributions when we just want a COUNT, right? It also only does one user at a time. It'd be neat to get an API added to it, but I don't think it would help us that much.

I also want rock solid test coverage on the statistics, and using production data in the CI build I don't think will work. Right now we have only a few test wikis, with just enough data to make assertions.

Okay, that makes sense.

Pages created: New pages created by the enrolled participants

Seems fine to me. We'd probably want to do rev_user IN (:userids) instead probably.

Yeah I thought about that, but it means we'd need query centralauth_p.localuser inaddition to globaluser (and again we likely will already have the usernames in the cache). I'll do some testing and choose whatever is faster!

We have the user IDs in our database already, right? We can fetch them before doing this query. Should be fast.

Pages improved: Number of distinct pages edited by enrolled participants

We're doing rev_parent_id > 0 so new pages created should be automatically excluded, right?

The initial revision that created the page, yes. But we'd still need to exclude all other revisions to that page, if that's what we want. I suppose here rev_parent_id > 0 just acts a safeguard, as the page_id NOT IN (:createdPagesIds) should discount any pages they created. I haven't really tested this query, but I think it will work. There may be a better way.

Okay, I missed a line you wrote earlier: "For this we want pages edited that are NOT any of the pages created as part of the event, right?". Hmm. I thought a page created would technically count as part of pages improved.
Even if we don't do that, it would still be easier to subtract the number of pages created from the number of pages improved than to extract out $createdPageIds and add that to the query. Then we can leave out the rev_parent_id > 0 clause too.

In T182083#3836511, @Niharika wrote:

Pages created: New pages created by the enrolled participants

We have the user IDs in our database already, right? We can fetch them before doing this query. Should be fast.

We store global user IDs, when unfortunately there's a separate user ID for each individual wiki :(

Okay, I missed a line you wrote earlier: "For this we want pages edited that are NOT any of the pages created as part of the event, right?". Hmm. I thought a page created would technically count as part of pages improved.

I just figured it was like "number of pages created" and "other pages that were edited", but indeed going from no article to a new article is an improvement!

Even if we don't do that, it would still be easier to subtract the number of pages created from the number of pages improved than to extract out $createdPageIds and add that to the query. Then we can leave out the rev_parent_id > 0 clause too.

Bah, that would be easier!! Hehe :)