Page MenuHomePhabricator

Implement functionality for RC page 'Experience level' filters
Closed, ResolvedPublic

Description

Research shows that new editors are particularly vulnerable to rejection. To enable reviewers to treat new users with the care they require -- and to enable reviewers to identify contributions by users at a couple of more advance levels—we will provide filtering by the following standardized "experience levels."

Newcomers
Fewer than 10 edits and 4 days of activity.

Learners
More days of activity and edits than “Newcomers” but fewer than “More experienced.” (corresponds to autoconfirmed)

Experienced users
More than 30 days of activity and 500 edits. (corresponds to extended confirmed)

Notes about the experience filters

  • The filters return Registered users only [changed 11/17].
  • These filter thresholds will, at least initially, be identical with those built into the ReviewStream feed. See T145159
  • Although two of these thresholds correspond with existing wiki user-rights definitions (see above), please refer to them internally and externally by the filter names above.

About the functionality of all new RC page filters generally

  • Like all filters in the enhanced RC page filter UI, these filters conform to a set of rules that are, in some ways, very different from the existing RC page filters. The existing filters are designed primarily to EXCLUDE selected properties. The new filters are intended to INCLUDE those properties; logically, the filters within each new group constitute a set of OR filters (each group of ORs being connected to other groups by ANDS). So, the filters in this group follow these rules:
    • To INCLUDE property A, users check the box for property A.
    • To EXCLUDE property A, the user must uncheck A and check it's complements, properties B and C.
    • If NEITHER A, B nor C are checked, then ALL are included the group has no effect.
    • If A, B and C are checked, then the result is the same: ALL experience levels are included.
  • As per T146076, searches on the RC page are meant to be bookmarkable. Please make sure your search adds query strings to the URL.

Event Timeline

jmatazzoni updated the task description. (Show Details)

The filters apply to edits by registered and unregistered users alike (it is understood that classification of IPs will be prone to various types of error).

@Catrope, @Mattflaschen-WMF: We don't maintain user accounts for IPs so we don't have 'registration date' and 'edit count' readily available to compute experience level. It's probably possible to mine this data from the revision table but that would be a heavy operation. Thoughts on how to accomplish this?

The filters apply to edits by registered and unregistered users alike (it is understood that classification of IPs will be prone to various types of error).

@Catrope, @Mattflaschen-WMF: We don't maintain user accounts for IPs so we don't have 'registration date' and 'edit count' readily available to compute experience level. It's probably possible to mine this data from the revision table but that would be a heavy operation. Thoughts on how to accomplish this?

I don't think there's much we can reasonably do for IPs, the concept of a registration date and an edit count don't make much sense for them, and tracking them would be expensive.

The reason we wanted to include IPs in these categories is that even when we made it as clear as possible in the interface that the Experience filters applied only to Registered users, testers wouldn't get it. They'd pick Newcomers and Unregistered, then not understand why they got no results. We knew classifying IPs would be inexact, but now it appears it's actually impossible.

@Catrope suggests two workarounds here:

  1. Put all IP users into one bucket. I.e., they could all arbitrarily be classified as Newcomers.
  2. Just ignore IPs in the filter, so that effectively IPs would be included in all the Experience categories (we might still add a note in the interface text to somehow explain...)

My sense is that either of these would be better than including a " "hidden" filter of making Experience filters return only to Registered. But I can't say which option is better. The newcomer filter has a variety of purposes. On one hand, we want people to support newcomers--but people don't like supporting IPs because the don't believe they're real users. On the other hand, vandalism patrollers will use the Newcomer filter because they mistrust Newcomers, just as they mistrust IPs....

Opinions? @Pginer-WMF?

Actually, having thought about it for a minute, I think option 2 above is the better choice.

@Pginer-WMF, we should discuss whether we want to mention this in the interface text. It's actually more or less the opposite of the function we'd defined at first....

I don't like option 2. I don't think it makes sense here. We ought to provide a mechanism by which someone can do the following things:

  1. Review edits by only anons
  2. Review edits by only experienced editors
  3. Review edits by anons and new editors

We ought to provide a mechanism by which someone can do the following

I think users could do all those things with option 2 by using Experienced filters and User Registration filters in combo, as follows:

  1. Review edits by only anons--click Unregistered and unclick Registered to filter for anons.
  2. Review edits by only experienced editors--click Registered and unclick Unregistered to filter for Registered only, then click Experienced.
  3. Review edits by anons and new editors -- Click for Newcomers.

I see. My mistake. Then +1

jmatazzoni added a comment.EditedNov 10 2016, 11:44 PM

What's wrong with this scheme I think is that a user could click to see Newcomers and Unregistered, and she'd think she was getting those results. But she wouldn't be. She'd just be seeing all the anons—the Newcomer filter would do nothing.

Would it make sense to do the following?

  • Make it so that if you filter for Unregistered only, then Experience filters are unavailable.
  • Add some language to the Experience filters qualifying that they only affect Registered users (people won't see it for the most part, but if the Experience filters gray out, it might help them figure out why).

What do you guys think?

Change 321744 had a related patch set uploaded (by Sbisson):
[WIP] Special:RC filter: userExpLevel

https://gerrit.wikimedia.org/r/321744

Pginer-WMF added a comment.EditedNov 16 2016, 2:03 PM

@Pginer-WMF, we should discuss whether we want to mention this in the interface text. It's actually more or less the opposite of the function we'd defined at first....

I think we can communicate interactions between filters using existing patterns such as grey-out filters/tags and the clarification popups. I'd not make filters to be unavailable since it may be problematic that options change as you try to move from A to B.

I don't think that it makes sense to artificially include IPs in a specific experience level. Between including them on all or none of the experience levels, I prefer the later. A more detailed analysis on possible solutions (using "newcomers" just as an example of one of the levels):

Consider IPs not to be in any of the experience levels This is my preferred option and I think we can provide better guidance than the one provided during initial testing (which was an easy to overlook note on parenthesis for the filter section)

  • When "newcomers" filter is selected, only registered newcomers are shown in the results. The "unregistered" filter becomes greyed out to communicate that IPs are not being included in the results.
  • When "unregistered" filter is selected, only IPs are shown. The "newcomers" (as well as the other experience filters) become greyed out to communicate that we cannot make distinctions for IPs in terms of expertise.
  • When both "unregistered" and "newcomer" are selected, the result list is empty since there is no intersection between them. Both of the tags are greyed out and a message explains these options cancel each other when hovering. "Registered", "unregistered", and all the experience filters are shown as greyed out in the panel.

Consider IPs to be in every experience level. The approach of considering IPs to be of all experience levels seems less intuitive:

  • When selecting "newcomers" users get a mix of registered newcomers and IPs of all ages. The presence of IPs in the results may suggest to filter those out (which is easy with the filters) but could also mislead the user to think those IPs are newcomers editing anonymously. This breaks the basic model of the listed elements to fulfill all the tags in the list of filters, which will be hard to signal.
  • When selecting "unregistered" filter, "newcomers" and all the other experience levels get greyed out to communicate they are not included in the results.
  • When selecting both "unregistered" and "newcomers", the "newcomers" filter is greyed out and an explanation message about the dependency will be shown when hovering the tag.
SBisson added subscribers: Marostegui, jcrespo.

Pinging @jcrespo and @Marostegui

For this ticket, we join recentchanges and user and filter on user_registration and user_editcount. As far as I can see, neither of these columns are indexed. Do you think this is problematic? If so, could you help us find a better strategy?

Here's a sample query that the code would generate (I use recentchanges.* for brevity but the code lists out the columns):

SELECT recentchanges.* 
FROM recentchanges JOIN user ON rc_user=user_id
WHERE ( user_registration > '20161112000000' AND user_editcount < 10 ) # newcomers
OR (user_registration <= '20161016000000' AND user_editcount >= 500 ) # 'more experienced' users

Thanks!

jcrespo moved this task from Triage to Backlog on the DBA board.Nov 16 2016, 5:25 PM

Pau and I discussed this and have decided to go back to making Experience filters return Registered users only. I've updated the task description above to reflect that.

@Pginer-WMF, below find my stab on what the rules would be for tags in the Active filter display area [T149391] and the filters in the Dropdown filter panel [T149452]. Please review them; once these are right I'll copy them to the relevant tasks. I also wrote two tooltips, which you can review at T149385.

Dropdown Menu Rules

  • If Unregistered is not selected and any Experience filter is selected
    • The "Unregistered" and Registered filters (in the User Registration group) becomes greyed out to communicate that they have no effect.
  • If no Experience filter is selected and the Unregistered filter is selected
    • The Experience filters are greyed out to communicate that they have no effect.
  • Once the user has selected the Unregistered filter and any combination of Experience filters at the same time, then
    • None of the selected filters is grayed out any more [@Pginer-wmf-- confirm, are none or all grayed out in this case? I could see it either way.]

Active Filter Display Area Rules

  • If the user selects the Unregistered filter and any combination of Experience filters at the same time, then
    • the tags in the Active Filter Area for all of the conflicting filters will turn gray
    • on rollover one of two tooltips will display: one tooltip for the Experience filters and one for the Unregistered filter (see T149385, under Tooltips for wording)
Pginer-WMF added a comment.EditedNov 18 2016, 10:54 AM
  • Once the user has selected the Unregistered filter and any combination of Experience filters at the same time, then
    • None of the selected filters is grayed out any more [@Pginer-wmf-- confirm, are none or all grayed out in this case? I could see it either way.]

I think the most consistent approach would be to have none greyed out. That makes it consistent with the idea that you can understand the results just by looking at the non-greyed out tags, and that removing the useless grey tags would lead to the same list of results.

If we find out that, making an exception to that is worth it to signal the special case, we can consider adding such exception (i.e., greying out both) or providing a special treatment for these kind of conflicts, but I'd start with the behaviour described above.

Pinging @jcrespo and @Marostegui
For this ticket, we join recentchanges and user and filter on user_registration and user_editcount. As far as I can see, neither of these columns are indexed. Do you think this is problematic? If so, could you help us find a better strategy?
Here's a sample query that the code would generate (I use recentchanges.* for brevity but the code lists out the columns):

SELECT recentchanges.* 
FROM recentchanges JOIN user ON rc_user=user_id
WHERE ( user_registration > '20161112000000' AND user_editcount < 10 ) # newcomers
OR (user_registration <= '20161016000000' AND user_editcount >= 500 ) # 'more experienced' users

No, that is not acceptable. It goes over 8million rows on every run:

MariaDB MARIADB s1-master enwiki > EXPLAIN SELECT recentchanges.* 
    -> FROM recentchanges JOIN user ON rc_user=user_id
    -> WHERE ( user_registration > '20161112000000' AND user_editcount < 10 ) # newcomers
    -> OR (user_registration <= '20161016000000' AND user_editcount >= 500 ) # 'more experienced' users\G
    -> \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 8478160
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: user
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.recentchanges.rc_user
         rows: 1
        Extra: Using where
2 rows in set (0.03 sec)
jcrespo added a comment.EditedNov 18 2016, 11:27 AM

If so, could you help us find a better strategy?

Use EXPLAIN and SHOW STATUS like 'Hand%'; to find a strategy that reads a minimal amount of rows per query. Test such a query on a local db install, that you should have. Compare with beta and labsdb instaces for "closer to production" experiences. Creating new indexes are not out of the table, but it should be justified (make sure, however, that those fields are in use in production- I would doubt user_editcount is updated there, but I am talking without knowing). Avoid ORs, generally speaking, they do not tend to give good opportunities for optimization, either do separate queries (which probably is the better strategy for what you want to do) or use UNION ALL. When hitting a wall with a specific problem, ask for help, I am here to help. When things are impossible to optimize, cache- maybe users can be classified before hand. Note that storing user information (even as innocent as number of edits or registering information) is handling private user information and requires all sort of oks from security and legal to avoid leaking PII.

I also wrote two tooltips, which you can review at T149385.

For the tooltips, I think it is better to refer to the effect filters have on results than the visual style applied (which is also unclear that we want to grey out the case of "experience + unregistered"). I'd focus the message to the "experience" filters since they are the ones with a less obvious dependency (e.g., on hovering "newcomer" we can show a "No results shown since this applies only to registered users which are excluded from the current filters." when registered users have been excluded by "unregistered"). That is, filters from a group which is dependent on another filter which has been excluded.

Other than that, it is complex to get into an explanation on what is causing empty results based on all possible filters: do we want to tell the user that "unregistered" and "my edits" exclude each other too?

On keeping rules general
Also, a more general comment about filter specs: I think it is really great that the behaviour is specified for each case since that simplifies later to check if the behaviour is the expected one for each ticket (e.g., when selecting "likely has problems" and "very likely have problems", then "May have problems" and "Very likely good" become greyed out).

However, we may want to also capture the general idea for such rules (e.g., when selecting a filter, other filters in the same group that are not sub-sets of the former become greyed-out). Otherwise a very literal interpretation of the spec could lead to a bunch of very specific rules that make the system harder to extend (e.g., when adding new groups, new filters or the relationships between filters change).

Developers working on these features may be already aware of this (@Mooeypoo, @SBisson) but I think it is important to make it clear since this may affect the flexibility of the system in the future. I'm not sure where is the best place to capture this, beyond the original design tickets.

If so, could you help us find a better strategy?

Use EXPLAIN and SHOW STATUS like 'Hand%'; to find a strategy that reads a minimal amount of rows per query. Test such a query on a local db install, that you should have. Compare with beta and labsdb instaces for "closer to production" experiences.

@jcrespo Thanks for your input.

I made a mistake in the sample query above. I should have included ORDER BY rc_timestamp DESC and LIMIT 30 (30 is the default but the user can select up to 500). I don't know if it changes anything since even if we want only 30 recent changes, we may have to scan a large number of users to find those with the right experience levels.

Creating new indexes are not out of the table, but it should be justified (make sure, however, that those fields are in use in production- I would doubt user_editcount is updated there, but I am talking without knowing).

user_editcount is a simple counter incremented with every edit. See User.php:4997. It is used for a few things already (auto promotions and notifications) but never for query AFAIK.

Avoid ORs, generally speaking, they do not tend to give good opportunities for optimization, either do separate queries (which probably is the better strategy for what you want to do) or use UNION ALL.

Yes, that's a good one but it would require refactoring the php side of things and make computing of ORDER BY and LIMIT more complex. I think the code that produces the conditions can be optimized quite a bit already. For instance, getting 'newcomers' or 'experienced' can be rewritten as less than 'more experienced', or getting 'newcomers' or 'more experienced' can be rewritten as not 'experienced'. While there's only 3 values we can do tricks like that.

When hitting a wall with a specific problem, ask for help, I am here to help. When things are impossible to optimize, cache- maybe users can be classified before hand.

Yes, precomputing into user_expLevel would also work as long as this column is indexed. Is there a big gain to indexing this column instead of user_registration and user_editcount?

Note that storing user information (even as innocent as number of edits or registering information) is handling private user information and requires all sort of oks from security and legal to avoid leaking PII.

Those 2 fields are already made public through the API but we have to make sure our usage of them is legit. (ping @Catrope What do we have to do? Who should we contact?)

Thanks!

For what it's worth, I've requested that an index be made on user_registration in the past (can't find the task). It seems that there's also interest in being able to query editors by when they registered via the API. See T28884: Allow sorting by creation date for list=allusers

@Pginer-WMF writes:

We may want to also capture the general idea for such rules...Otherwise a very literal interpretation of the spec could lead to a bunch of very specific rules

Yes, I had the same thought. Thanks for the reminder. As you say, I was trying to give developers and QA a checklist, so they don't have to work all the interactions out themselves. But underlining the general principles is important too.

For the tooltips, I think it is better to refer to the effect filters have on results than the visual style applied

All the tag tooltips start by saying "shown in gray" because I think it's important to make the connection between the problem and the display state -- to train the user that when they see a grayed out tag, it means something is not right.

I'd focus the message to the "experience" filters since they are the ones with a less obvious dependency

True but it's a reciprocal relationship: Unregistered is only a problem when it's combined with Experience. Both have to be grayed out so users see the conflict. And if both are grayed, the information about the problem may as well be in both.

do we want to tell the user that "unregistered" and "my edits" exclude each other too?

To be sure, we don't want to take this too far. But Unregistered + Experience is a special case, I think, were we not only know that a conflict is built in and will happen every time, but that the conflict is caused by a hidden property. So it's worth leaving plenty of clues to help users figure out the issue.

(BTW, OMG, I hadn't even thought about whether the My Edits filter applies only if the user is logged in. It's entirely possible. Arghhh. )

It was suggested on the patch that we make the user experience levels completely configurable (how many of them, their names, and their thresholds).

I think it's a good idea in theory but it adds complexity.

In the ERI project, do we envision some wikis wanting to have only 2 levels or as many as 4 or 5? Ping @Quiddity, @jmatazzoni, @Pginer-WMF

I had imagined only that people would shift the definitions of the three levels we're providing. The idea of being able to add more levels is an interesting one. I could certainly imagine people wanting them (at least one UI tester suggested a new one that already). But we could always have more of everything in this are. How much extra complexity are we talking about? I wouldn't go too far...

It was suggested on the patch that we make the user experience levels completely configurable (how many of them, their names, and their thresholds).

In the meeting today, we decided to stick with three levels with standardized names, and to make only the thresholds configurable.

From all the possible userExpLevel filter combinations, I believe experienced (the middle tier) and newcomer,moreexperienced (everything but the middle tier) generate the most complex queries. Here's how both of them behave against enwiki db on stat1003.

userExpLevel=experienced

SELECT
rc_id,rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,wl_user,wl_notificationtimestamp,page_latest,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,'0.5' AS `ores_damaging_threshold` 
FROM `recentchanges` LEFT JOIN `watchlist` ON (wl_user = '9' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `user` ON ((rc_user = user_id)) LEFT JOIN `ores_model` `ores_damaging_mdl` ON (ores_damaging_mdl.oresm_is_current = '1' AND ores_damaging_mdl.oresm_name = 'damaging') LEFT JOIN `ores_classification` `ores_damaging_cls` ON ((ores_damaging_cls.oresc_model = ores_damaging_mdl.oresm_id) AND (rc_this_oldid = ores_damaging_cls.oresc_rev) AND ores_damaging_cls.oresc_class = '1') 
WHERE rc_bot = '0' 
AND (rc_timestamp >= '20161128000000') 
AND ((((user_editcount >= 10) AND (user_registration <= 20161201151627)) AND ((user_editcount < 500) OR (user_registration > 20161105151627)))) 
AND rc_new IN ('0','1') 
ORDER BY rc_timestamp DESC 
LIMIT 50;

Output from EXPLAIN:

1+------+--------------------+-------------------+--------+---------------------------------------+-------------------+---------+-----------------------------------------------------------------------+---------+-------------+
2| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
3+------+--------------------+-------------------+--------+---------------------------------------+-------------------+---------+-----------------------------------------------------------------------+---------+-------------+
4| 1 | PRIMARY | recentchanges | ref | rc_timestamp,new_name_timestamp,tmp_2 | tmp_2 | 1 | const | 4132449 | Using where |
5| 1 | PRIMARY | watchlist | eq_ref | wl_user,namespace_title | wl_user | 265 | const,enwiki.recentchanges.rc_namespace,enwiki.recentchanges.rc_title | 1 | |
6| 1 | PRIMARY | user | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.recentchanges.rc_user | 1 | Using where |
7| 1 | PRIMARY | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.recentchanges.rc_cur_id | 1 | |
8| 1 | PRIMARY | ores_damaging_mdl | ref | oresm_version,ores_model_status | ores_model_status | 35 | const,const | 1 | Using where |
9| 1 | PRIMARY | ores_damaging_cls | ref | oresc_winner | oresc_winner | 4 | enwiki.recentchanges.rc_this_oldid | 8 | Using where |
10| 2 | DEPENDENT SUBQUERY | change_tag | ref | ct_rc_id,change_tag_rc_tag | ct_rc_id | 5 | enwiki.recentchanges.rc_id | 1 | Using index |
11+------+--------------------+-------------------+--------+---------------------------------------+-------------------+---------+-----------------------------------------------------------------------+---------+-------------+
127 rows in set (0.02 sec)

The query runs in 0.2s on stat1003 against enwiki db.

userExpLevel=newcomer,moreexperienced

SELECT rc_id,rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,wl_user,wl_notificationtimestamp,page_latest,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,'0.5' AS `ores_damaging_threshold` 
FROM `recentchanges` LEFT JOIN `watchlist` ON (wl_user = '9' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `user` ON ((rc_user = user_id)) LEFT JOIN `ores_model` `ores_damaging_mdl` ON (ores_damaging_mdl.oresm_is_current = '1' AND ores_damaging_mdl.oresm_name = 'damaging') LEFT JOIN `ores_classification` `ores_damaging_cls` ON ((ores_damaging_cls.oresc_model = ores_damaging_mdl.oresm_id) AND (rc_this_oldid = ores_damaging_cls.oresc_rev) AND ores_damaging_cls.oresc_class = '1') 
WHERE rc_bot = '0' 
AND (rc_timestamp >= '20161128000000') 
AND (((user_editcount < 10) OR (user_registration > 20161201153145)) OR ((user_editcount >= 500) AND (user_registration <= 20161105153145))) 
AND rc_new IN ('0','1') 
ORDER BY rc_timestamp DESC 
LIMIT 50;

Output from EXPLAIN:

1+------+--------------------+-------------------+--------+---------------------------------------+-------------------+---------+-----------------------------------------------------------------------+---------+-------------+
2| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
3+------+--------------------+-------------------+--------+---------------------------------------+-------------------+---------+-----------------------------------------------------------------------+---------+-------------+
4| 1 | PRIMARY | recentchanges | ref | rc_timestamp,new_name_timestamp,tmp_2 | tmp_2 | 1 | const | 4131992 | Using where |
5| 1 | PRIMARY | watchlist | eq_ref | wl_user,namespace_title | wl_user | 265 | const,enwiki.recentchanges.rc_namespace,enwiki.recentchanges.rc_title | 1 | |
6| 1 | PRIMARY | user | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.recentchanges.rc_user | 1 | Using where |
7| 1 | PRIMARY | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.recentchanges.rc_cur_id | 1 | |
8| 1 | PRIMARY | ores_damaging_mdl | ref | oresm_version,ores_model_status | ores_model_status | 35 | const,const | 1 | Using where |
9| 1 | PRIMARY | ores_damaging_cls | ref | oresc_winner | oresc_winner | 4 | enwiki.recentchanges.rc_this_oldid | 8 | Using where |
10| 2 | DEPENDENT SUBQUERY | change_tag | ref | ct_rc_id,change_tag_rc_tag | ct_rc_id | 5 | enwiki.recentchanges.rc_id | 1 | Using index |
11+------+--------------------+-------------------+--------+---------------------------------------+-------------------+---------+-----------------------------------------------------------------------+---------+-------------+
127 rows in set (0.08 sec)

The query runs in 0.25s on stat1003 against enwiki db.

When comparing query performance, EXPLAIN may be useful, but only when combined with Handler\_% status (actual rows read, with several strategies). EXPLAIN can be misleading in some cases. Let me give you an example on how to use it:

MariaDB [enwiki]> FLUSH STATUS;
Query OK, 0 rows affected (0.00 sec)

MariaDB [enwiki]> pager md5sum
PAGER set to 'md5sum'
MariaDB [enwiki]> SELECT rc_id,rc_timestamp,rc_user,rc_user_text,rc_namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,wl_user,wl_notificationtimestamp,page_latest,(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`,ores_damaging_cls.oresc_probability AS `ores_damaging_score`,'0.5' AS `ores_damaging_threshold` 
    -> FROM `recentchanges` LEFT JOIN `watchlist` ON (wl_user = '9' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `user` ON ((rc_user = user_id)) LEFT JOIN `ores_model` `ores_damaging_mdl` ON (ores_damaging_mdl.oresm_is_current = '1' AND ores_damaging_mdl.oresm_name = 'damaging') LEFT JOIN `ores_classification` `ores_damaging_cls` ON ((ores_damaging_cls.oresc_model = ores_damaging_mdl.oresm_id) AND (rc_this_oldid = ores_damaging_cls.oresc_rev) AND ores_damaging_cls.oresc_class = '1') 
    -> WHERE rc_bot = '0' 
    -> AND (rc_timestamp >= '20161128000000') 
    -> AND (((user_editcount < 10) OR (user_registration > 20161201153145)) OR ((user_editcount >= 500) AND (user_registration <= 20161105153145))) 
    -> AND rc_new IN ('0','1') 
    -> ORDER BY rc_timestamp DESC 
    -> LIMIT 50;
b8e3b5057842ada84918e4731a4f0d50  -
50 rows in set (0.01 sec)

MariaDB [enwiki]> nopager
PAGER set to stdout
MariaDB [enwiki]> SHOW STATUS like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 325   |
| Handler_read_last          | 1     |
| Handler_read_next          | 98    |
| Handler_read_prev          | 76    |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
25 rows in set (0.00 sec)

400-500 rows were scanned, which is reasonable unless we want to go into the microsecond-performance.

Please rethink names used in this filter. "Newcomer" equals "non-autoconfirmed" and it's more or less OK (one can be a newcomer despite having autoconfirmed account). But "not-entirely-new" or "newest among autoconfirmed" as "Experienced" is at least confusing. It's admins/authors of 1000+ edits/long-term users who are considered to be experienced. And "More experienced" than "newest among autoconfirmed"... I'm not sure.

Maybe something like "Freshly registered", "Beginners", and "Apprentices"? I'm just brainstorming.

I agree with @tarlocesilion.

Fewer than 10 edits and 4 days of activity. -> Newcomer
More days of activity and edits than “Newcomers” but fewer than “More experienced.” (corresponds to autoconfirmed) -> Apprentice or Learner
More than 30 days of activity and 500 edits. (corresponds to extended confirmed) -> Experienced

You are not considered as experienced until you haven't made enough edits to have an active "oldest" link on your contributions list. I think that's how people define experienced users on wikis too.

@Trizek-WMF, yes, you're right as for that list.

But I'd recommend to avoid the term "experience". This filter isn't designed to find such users. There's a gradation of newbieness. As in child development stages, which, among others, describes "older children" and "adolescents", not "extremely young adults" and "less extremely young adults".

These terms are used in code and URL parameters only. Interface text can be discussed in T149385: Approved interface text for RC page interface elements.

But I agree that these terms can be confusing and become out of sync with whatever we settle on for the UI. Also the thresholds are expected to change per-wiki and over time as we learn more about the target audience of this project. They may not always line up with autoconfirmed and extended confirmed.

I'm tempted to find something more neutral (i.e. level 1, 2, 3) for the code so it doesn't mean anything more than whatever threshold is configured here and now.

Change 321744 merged by jenkins-bot:
Special:RC filter: userExpLevel

https://gerrit.wikimedia.org/r/321744

Checked in betalabs for three filters userExpLevel=experienced,newcomer,learner

(1) Using any of the filter options with 'Hide registered users' e.g. hideliu=1&userExpLevel=experienced will not display anything - "No changes during the given period matching these criteria." which is a correct behavior, in my opinion.

Special:RecentChanges&hideanons=1&hideliu=1&userExpLevel=newcomer,learner,experienced will display the same results as Special:RecentChanges&hideanons=0&hideliu=1.

I've read the discussion above and saw that there will be some UI based limitations on filter selection.

(2) Note for testing UI: Having more than 10 edits in less than 4 days, won't make a user "Experienced" (aka in 'learner' filter category). Both conditions of the use level definition should be fulfilled before users will go to a different category.

(3)
@SBisson

  • regarding indexing user_editcount - the cardinality is very low, so the query optimizer most likely won't use the index anyway. As far as I could see those long queries ( e.g. userExpLevel=experienced) use existing indexes and scan not too many records for the results.

@jmatazzoni I tested the logic and correctness (according to the specs) of the filters for Newcomer, Experienced, and Experienced types of users. Some additional testing will be done after UI implementation.

QA recommendation: Resolve.

Thanks for the recommendation @Etonkovidova . I have one question for you and @SBisson: The names of these filters changed at some point recently (I've just corrected them in the Description above). The names are now;

  • Newcomers
  • Learners
  • Experienced users

So my question is, are the old filter names enshrined in the code? If so, it seems like it would avoid confusion if we update the language in the code to match the user-facing language. What do you think?

Thanks for the recommendation @Etonkovidova . I have one question for you and @SBisson: The names of these filters changed at some point recently (I've just corrected them in the Description above). The names are now;

  • Newcomers
  • Learners
  • Experienced users

So my question is, are the old filter names enshrined in the code? If so, it seems like it would avoid confusion if we update the language in the code to match the user-facing language. What do you think?

The code is up to date with the latest user experience level names.

jmatazzoni closed this task as Resolved.Jan 3 2017, 5:22 PM
Mattflaschen-WMF added a comment.EditedMar 5 2017, 3:51 AM

If NEITHER A, B nor C are checked, then ALL are included.
If A, B and C are checked, then the result is the same: ALL are included.

I think this contradicts T156427: Implement the Conflict display states and messages which says, "If the user selects the Unregistered filter and any combination of Experience filters at the same time, then the filters cancel each other out, no results will be found"

Checking all of the user experience filters is one example of "select[] [...] any combination of Experience filters".

So if they check all the Experience filters, plus check only Unregistered, no results will be found.

But if they check none of the Experience filters, but check only Unregistered, results will be found.

(This is because the Experience filters are not full coverage).

Tentatively corrected in description.

Also, "all" (or listing them all like userExpLevel=newcomer,learner,experienced) does not follow "The filters return Registered users only" currently. I plan to adjust this in my next patch set:

https://en.wikipedia.beta.wmflabs.org/wiki/Special:RecentChanges?userExpLevel=all&hideliu=1