Page MenuHomePhabricator

Add recently active users count to Special:GadgetUsage
Closed, ResolvedPublic3 Story Points

Details

Related Gerrit Patches:
mediawiki/extensions/Gadgets : masterAdd column for active users to Special:GadgetUsage

Event Timeline

Quiddity created this task.Oct 28 2015, 6:57 AM
Quiddity raised the priority of this task from to Needs Triage.
Quiddity updated the task description. (Show Details)
Quiddity added a subscriber: Quiddity.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2015, 6:57 AM
Niharika set Security to None.
Niharika added a subscriber: Niharika.

@Quiddity, how would you define an "active user"?

We could use $wgActiveUserDays. That's what Special:Activeusers also uses.

Ideally it would use user_touched somehow (?), because that counts anyone who logs in. (Manual:User_table#user_touched)
That way, we would include readers, and semi-active editors, who log in just to have access to specific preferences/gadgets/tools/watchlist/customizations/notifications/etc.

If that isn't possible, then $wgActiveUserDays is good.

Ideally it would use user_touched somehow (?), because that counts anyone who logs in. (Manual:User_table#user_touched)

user_touched isn't reliable because it will also update whenever someone writes on your talk page for example.

Or if you update someone's preferences, that should also trigger a user_touched update. We did this with a few million users recently.

user_touched isn't reliable because it will also update whenever someone writes on your talk page for example.

That's not too bad. It would add a few false positives, but would be including the people (non-editing readers and site-preference-users) whom we really do want to include, which $wgActiveUserEditCount wouldn't include.

Or if you update someone's preferences, that should also trigger a user_touched update. We did this with a few million users recently.

Ah, right. That would be terrible for stats purposes then. Oh well. :(

Niharika edited a custom field.
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 3 2015, 11:20 PM
Niharika claimed this task.Nov 3 2015, 11:20 PM
Niharika moved this task from Ready to In Development on the Community-Tech-Sprint board.

@NiharikaKohli: It looks like Special:ActiveUsers stores the currently active users in the querycachetwo table (where qcc_type = 'activeusers'). The qcc_title field is the username. So you'll need to join with the querycachetwo table and do a count of all the usernames that are both gadget users and activeusers. This will be a fairly complicated query though, so you may have to find someone with serious SQL kung-fu to help (like @aaron or @jcrespo).

He7d3r added a subscriber: He7d3r.Nov 5 2015, 10:32 AM
DannyH edited a custom field.Nov 10 2015, 5:41 PM

Hi @aaron, @jcrespo, (and maybe @Halfak :) we need some help here. The SQL query being run right now (to get the number of users per gadget on the wiki) is:

SELECT up_property, SUM(up_value) FROM user_properties
WHERE up_property LIKE 'gadget-%'
GROUP BY up_property

Simple enough.

Now we want to add an additional column to the output for the number of 'active users' per gadget, where an active is user is someone who's made an edit in the last X days. This list of users is computed by another special page and stored in querycachetwo table, as pointed out by @kaldari above (it stores the user name only and not the user id) and I'm unable to figure out the right query for doing that. Help?

Joining on the revision table is another option, but it's likely to be costlier?

He're what I've got. It's not terribly fast, but it works:

SELECT up_property, SUM(up_value) 
FROM user_properties
INNER JOIN (
  SELECT rev_user FROM revision_userindex
  WHERE rev_timestamp BETWEEN "201510" AND "201511"
  GROUP BY rev_user
  HAVING COUNT(rev_id) >= 5
) AS active_editor ON
  rev_user = up_user
WHERE 
  up_property LIKE 'gadget-%'
GROUP BY up_property;

Alternatively:

SELECT up_property, SUM(up_value) 
FROM user_properties
WHERE 
  up_user IN (
    SELECT rev_user FROM revision
    WHERE rev_timestamp BETWEEN "201510" AND "201511"
    GROUP BY rev_user
    HAVING COUNT(rev_id) >= 5
  ) AND
  up_property LIKE 'gadget-%'
GROUP BY up_property;

I'm not sure which version is faster. I suppose we can substitute in the querycachetwo table, but the computation of active editors in the last month is actually the fastest part of the query (5 seconds in my tests) because it uses the user_timestamp index on revision and doesn't even need to read the table.

He're what I've got. It's not terribly fast, but it works:

SELECT up_property, SUM(up_value) 
FROM user_properties
INNER JOIN (
  SELECT rev_user FROM revision_userindex
  WHERE rev_timestamp BETWEEN "201510" AND "201511"
  GROUP BY rev_user
  HAVING COUNT(rev_id) >= 5
) AS active_editor ON
  rev_user = up_user
WHERE 
  up_property LIKE 'gadget-%'
GROUP BY up_property;

Alternatively:

SELECT up_property, SUM(up_value) 
FROM user_properties
WHERE 
  up_user IN (
    SELECT rev_user FROM revision
    WHERE rev_timestamp BETWEEN "201510" AND "201511"
    GROUP BY rev_user
    HAVING COUNT(rev_id) >= 5
  ) AND
  up_property LIKE 'gadget-%'
GROUP BY up_property;

I'm not sure which version is faster. I suppose we can substitute in the querycachetwo table, but the computation of active editors in the last month is actually the fastest part of the query (5 seconds in my tests) because it uses the user_timestamp index on revision and doesn't even need to read the table.

Thanks @Halfak! This does give me a column for the active editor user count but it also takes away the column for the total user count. :( I imagine to get both I need to do a join on the user_properties again?

Sorry. You want total user count *and* active user count?

jcrespo added a comment.EditedNov 23 2015, 10:58 PM

This query has the following assumptions:

  • up_value = 0 means disabled, 1 means enabled
  • querycachetwo reliably records the active users
  • There is an existing user owning every existing user_property (do not ask me why, but this is false)
  • I will not allow this to run on real time on page load, it is too expensive for medium and large size wikis. Here it is the results on enwikivoyage:
MariaDB DBSTORE localhost enwikivoyage > SELECT up_property, SUM(up_value), count(qcc_title) 
FROM user_properties 
LEFT JOIN user ON up_user = user_id 
LEFT JOIN querycachetwo ON user_name = qcc_title AND qcc_type = 'activeusers' AND up_value = 1 
WHERE up_property LIKE 'gadget-%' 
GROUP BY up_property;
+------------------------------+---------------+------------------+
| up_property                  | SUM(up_value) | count(qcc_title) |
+------------------------------+---------------+------------------+
| gadget-Carousel              |             0 |                0 |
| gadget-DotsSyntaxHighlighter |             7 |                3 |
| gadget-JS_toolset            |            74 |                7 |
| gadget-ListingEditor         |             0 |                0 |
| gadget-ListingEditor2        |             7 |                2 |
| gadget-MapFrame              |             0 |                0 |
| gadget-RTRC                  |           109 |               15 |
| gadget-RenameWizard          |             0 |                0 |
| gadget-UTCLiveClock          |           164 |               10 |
| gadget-edittop               |            70 |                8 |
| gadget-popups                |           182 |               10 |
| gadget-vector-headanchor     |            33 |                1 |
| gadget-wikEd                 |           102 |                4 |
+------------------------------+---------------+------------------+
13 rows in set (0.07 sec)

Sorry. You want total user count *and* active user count?

Yes please!
Ideally/eventually, we hope to be able to track changes over time (monthly stat dumps to csv?), and have an indication of which gadgets are on by default, and a few other parameters, too (per my and He7d3r's musings in T21288#240675 and below).


[...] I will not allow this to run on real time on page load, it is too expensive for medium and large size wikis. Here it is the results on enwikivoyage: [...]

Re: "expensive", I saw the description at T115152 says

It would probably be an expensive query (at least on larger wikis), so it should have isExpensive() return true. See /includes/specials/SpecialMediaStatistics.php for example.

does that help at all?
Also, the data is cached for at least 24 hours, per the top-note at e.g. https://www.mediawiki.org/wiki/Special:GadgetUsage
(Mostly just in case you hadn't seen either of those. I know almost nothing of databases! :)

@jcrespo: Thanks, that's exactly what we need! A few replies to your comments...

up_value = 0 means disabled, 1 means enabled

This is correct.

querycachetwo reliably records the active users

This seems pretty much correct to me. The info is cached, but the cache gets updated every time a user makes an edit. We could check the querycache_info table first to make sure there is an entry with qci_type'= 'activeusers' (and possibly check the cache staleness as well). @aaron: Do you know of any reason we wouldn't be able to rely on querycachetwo having up to date active user data? Is there any check we should perform first?

There is an existing user owning every existing user_property (do not ask me why, but this is false)

Noted, although I'm hoping this is a rare case.

I will not allow this to run on real time on page load, it is too expensive for medium and large size wikis.

Is it OK as an expensive QueryPage (what the page is already marked as)? In theory we could skip joining against the user table if RecentChangesUpdateJob::updateActiveUsers() were changed to store the userID in the qcc_namespacetwo field. Right now it just stores 0 in all cases. @aaron: Would that be a terrible hack or a great idea? How often are the querycache fields abused for things other than their stated field names?

aaron added a comment.Nov 24 2015, 3:58 AM

querycachetwo should be OK to use. The qcc_namespacetwo abuse seems to hacky, and it would still have to be marked as expensive anyway.

I will not allow this to run on real time on page load, it is too expensive for medium and large size wikis.

Is it OK as an expensive QueryPage (what the page is already marked as)? In theory we could skip joining against the user table if RecentChangesUpdateJob::updateActiveUsers() were changed to store the userID in the qcc_namespacetwo field. Right now it just stores 0 in all cases. @aaron: Would that be a terrible hack or a great idea? How often are the querycache fields abused for things other than their stated field names?

Can we rename the table column instead (qcc_namespacetwo to qcc_userid or something)?

@NiharikaKohli: No, can't rename the table column since that table is already being used for tons of other stuff, and in most cases it is actually storing a second namespace (see https://en.wikipedia.org/wiki/Special:DisambiguationPageLinks for example). I have no idea why Special:ActiveUsers is caching results in querycachetwo instead of querycache though, as it's not making use of the 2 extra columns at all (which is the only difference in the two tables). As long as @jcrespo is OK with this running as a properly designated expensive query, I think we should go with the query that he suggested.

Excuse my lack of knowledge of the mediawiki terminology regarding Special pages. The query I sent is hackish, but I would be ok with it running with cron on terbium, but not on every page reload (Is that what you mean with "expensive query"?). It can run in real time (for each http request) for wikis ~ smaller in number of users than enwikivoyage.

jcrespo removed a subscriber: jcrespo.Nov 29 2015, 7:44 PM

@jcrespo: The QueryPage class (which is a special subclass of SpecialPage specifically tailored for executing database queries) has a method called IsExpensive(). If IsExpensive() returns true (as it is hardcoded to do in this case), and the wiki is set to run in "MiserMode" (https://www.mediawiki.org/wiki/Manual:$wgMiserMode), as all WMF wikis are, the actual database query for the special page is only run periodically and the results are normally pulled from a database cache (typically the querycache table). I don't remember what controls the expiry of expensive QueryPage caches, but I imagine @aaron does.

They don't expire, they just get rebuild when the chron runs next.

Change 256021 had a related patch set uploaded (by Niharika29):
Add column for active users to Special:GadgetUsage

https://gerrit.wikimedia.org/r/256021

Change 256021 merged by jenkins-bot:
Add column for active users to Special:GadgetUsage

https://gerrit.wikimedia.org/r/256021

kaldari triaged this task as Medium priority.Dec 7 2015, 7:56 PM
kaldari moved this task from Needs Review/Feedback to Q3 2018-19 on the Community-Tech-Sprint board.
DannyH closed this task as Resolved.Dec 8 2015, 5:53 PM
DannyH moved this task from Q3 2018-19 to Q1 2018-19 on the Community-Tech-Sprint board.
DannyH added a subscriber: DannyH.
DannyH moved this task from Epic backlog to Archive on the Community-Tech board.Dec 8 2015, 6:54 PM

As a quick benchmark for how this affects performance of the query, I ran ...

mwscript maintenance/updateSpecialPages.php --wiki=frwiktionary --only=GadgetUsage

around 01:45 UTC today (before the code change) and will try running it again tomorrow for comparison.

Today's result was:

GadgetUsage                    [QueryPage] got 74 rows in 0.77s

Running the new query on French Wiktionary took about 3 times longer:

GadgetUsage                    [QueryPage] got 74 rows in 2.67s

Running the new query on Commons took quite a while, however:

GadgetUsage                    [QueryPage] SlowTimer [3486901ms] got 113 rows in 58m 6.96s