Instrument how often various filters on Special:Recentchanges are used
Closed, ResolvedPublic
Actions

Description

In order to provide a baseline to compare usage of the new ORES-based filters against, we should get instrumentation about filter usage early.

We would probably want something like:

The state of each boolean filter (minor, bots, unregistered, registered, myself, categorization, wikidata*, nondamaging*) for each view of Special:RecentChanges
The effective value of the namespace filter (if set), taking into account the "Invert" and "Associated namespace" checkboxes
The value of the tag filter

This task is for adding instrumentation to RC to record this data, but @Neil_P._Quinn_WMF et al should probably weigh in on how we can best instrument this.

Details

Subject	Repo	Branch	Lines +/-
Add "enhancedFiltersEnabled" boolean to RC filters logger	mediawiki/extensions/WikimediaEvents	master	+3 -2
Only log ChangesList filters for logged-in users	mediawiki/extensions/WikimediaEvents	master	+5 -0
Add EventLogging for Special:RecentChanges filter usage	mediawiki/extensions/WikimediaEvents	master	+55 -0

Customize query in gerrit

Related Objects

Mentioned Here: T158344: Add technology so we'll be able to track usage of RC Page highlighting
T146233: The RecentChangesLinked link in the (desktop) toolbox should be nofollowed
T141319: [REQUEST] Which features of RecentChanges are most often used?

Event Timeline

Catrope created this task.Aug 31 2016, 12:41 AM

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptAug 31 2016, 12:41 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Catrope edited projects, added Collab-Team-Q1-July-Sep-2016; removed Collaboration-Team-Triage.Aug 31 2016, 12:43 AM

• jmatazzoni moved this task from Backlog to Development on the Edit-Review-Improvements board.Sep 1 2016, 7:11 PM

Should be deduplicated with T141319: [REQUEST] Which features of RecentChanges are most often used?

• jmatazzoni moved this task from Untriaged to Ready for Pickup on the Collab-Team-Q1-July-Sep-2016 board.Sep 1 2016, 7:28 PM

Change 308091 had a related patch set uploaded (by Mooeypoo):
Add EventLogging for Special:RecentChanges filter usage

https://gerrit.wikimedia.org/r/308091

gerritbot added a project: Patch-For-Review.Sep 1 2016, 11:03 PM

Mooeypoo claimed this task.Sep 1 2016, 11:04 PM

Mooeypoo moved this task from Ready for Pickup to Needs Review on the Collab-Team-Q1-July-Sep-2016 board.

Change 308091 merged by jenkins-bot:
Add EventLogging for Special:RecentChanges filter usage

https://gerrit.wikimedia.org/r/308091

Catrope moved this task from Needs Review to QA Review on the Collab-Team-Q1-July-Sep-2016 board.Sep 6 2016, 4:24 PM

This isn't actionable now, but for posterity: once a beta feature changing the appearance of the RC filters exists, we should add a field for whether that beta feature is enabled to the logging schema.

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-09-13_(1.28.0-wmf.19)).Sep 6 2016, 5:00 PM

As discussed in the meeting, that commit measures per-view overrides using the URL query string.

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

This should be taken into account when analyzing the results.

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

@Mooeypoo and @Mattflaschen-WMF - in user Preferences there is 'ORES sensitivity' setting (Low and High options) - are we interested in tracking it?

Etonkovidova moved this task from QA Review to Product Review on the Collab-Team-Q1-July-Sep-2016 board.Sep 7 2016, 10:41 PM

In T144331#2617504, @Etonkovidova wrote:

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

@Mooeypoo and @Mattflaschen-WMF - in user Preferences there is 'ORES sensitivity' setting (Low and High options) - are we interested in tracking it?

Oh, that's an interesting idea, we probably should. We can already get stats on whether more people use high or low sensitivity, but without this we couldn't tell if people with the high/low sensitivity setting use the filter more/less often.

This looks really good to me! I've been trying to derive this data from the raw webrequest logs, but this will be a much cleaner and simpler way of getting the data.

In T144331#2613574, @Mattflaschen-WMF wrote:

As discussed in the meeting, that commit measures per-view overrides using the URL query string.

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

This should be taken into account when analyzing the results.

Definitely. I'm already working on analyzing user preferences related to RC, so this will complement that.

In T144331#2617524, @Catrope wrote:

Oh, that's an interesting idea, we probably should. We can already get stats on whether more people use high or low sensitivity, but without this we couldn't tell if people with the high/low sensitivity setting use the filter more/less often.

Interesting idea. The same reasoning applies to any preference that can be overridden by a toggle on the page: the max number of days, the max number of entries, hiding minor edits, hiding categorization, and showing Wikidata entries. But maybe we're just less interested in those?

In T144331#2623326, @Neil_P._Quinn_WMF wrote:

Interesting idea. The same reasoning applies to any preference that can be overridden by a toggle on the page: the max number of days, the max number of entries, hiding minor edits, hiding categorization, and showing Wikidata entries. But maybe we're just less interested in those?

Those aren't overridden so much as that their default values are configurable, and that's already taken into account in a way that I don't remember. @Mooeypoo can you explain that?

In T144331#2623398, @Catrope wrote:

In T144331#2623326, @Neil_P._Quinn_WMF wrote:

Interesting idea. The same reasoning applies to any preference that can be overridden by a toggle on the page: the max number of days, the max number of entries, hiding minor edits, hiding categorization, and showing Wikidata entries. But maybe we're just less interested in those?

Those aren't overridden so much as that their default values are configurable, and that's already taken into account in a way that I don't remember. @Mooeypoo can you explain that?

I'm sorry, I don't understand the question? Are we talking about tracking users' preferences alongside the requested filters they're asking? Didn't we say we don't want to do that?

In T144331#2624055, @Mooeypoo wrote:

Didn't we say we don't want to do that?

Yes

What I mean is: we aren't tracking users' preferences for the default values of filters alongside the effective values they chose. That's fine. However, the hidenondamaging filter has a preference that's not about whether the filter defaults to being on or off, but that influences how it works (changes the sensitivity level). That's the one that Elena proposed we add.

• jmatazzoni edited projects, added Edit-Review-Improvements-RC-Page; removed Edit-Review-Improvements.Sep 12 2016, 10:55 PM

• jmatazzoni moved this task from Backlog to Development on the Edit-Review-Improvements-RC-Page board.Sep 13 2016, 12:09 AM

nshahquinn-wmf moved this task from Backlog to Blocked on the Contributors-Analysis board.Sep 13 2016, 5:45 PM

This is producing results now, but unfortunately they're dominated by crawlers. I'll ask the analytics people if they know of any tricks, but we should probably turn this off soon, and do something to reduce the volume before we turn it back on. This has been running on full blast (group2) for less than 24 hours and we already have 4.7 million rows.

Random idea: what if we only logged this for logged-in users?

In T144331#2644110, @Catrope wrote:

Random idea: what if we only logged this for logged-in users?

I was about to suggest the exact same thing. If not collect only logged-in users, at the very least tag the data with whether or not that came from a logged-in user.

There must be a way to recognize crawler from human user, though? Aren't most crawlers "identifying" themselves?

Change 311162 had a related patch set uploaded (by Catrope):
Only log ChangesList filters for logged-in users

https://gerrit.wikimedia.org/r/311162

In T144331#2644118, @Mooeypoo wrote:

In T144331#2644110, @Catrope wrote:

Random idea: what if we only logged this for logged-in users?

I was about to suggest the exact same thing. If not collect only logged-in users, at the very least tag the data with whether or not that came from a logged-in user.

That's also a good idea, but because we're also receiving a LOT of data, I think it'll be better to not log data from anons in the first place, to keep the logging volume under control.

There must be a way to recognize crawler from human user, though? Aren't most crawlers "identifying" themselves?

Kind of. They have User-Agent headers that contain words like BingBot or Yahoo! Slurp or blahblah Search Engine, etc. I came up with 10 words to filter out before I gave up. The Analytics people probably have a good list of crawler UAs, but if we can just drop logged-out users, why bother.

In T144331#2644152, @Catrope wrote:

Kind of. They have User-Agent headers that contain words like BingBot or Yahoo! Slurp or blahblah Search Engine, etc. I came up with 10 words to filter out before I gave up. The Analytics people probably have a good list of crawler UAs, but if we can just drop logged-out users, why bother.

The only reason I would say we should bother is if that data is important; Do we not want to get the data about how anonymous users (real users, not robots) use the filters on the page? If we ignore that group, will it affect our analysis? How many actual anonymous users use this page to begin with?

I think those might be valid questions we would like to answer. If the way to filter out bots is not insane, then it might be worth trying to keep this data in -- but tag filters with whether or not the user was logged in.

Change 311162 merged by jenkins-bot:
Only log ChangesList filters for logged-in users

https://gerrit.wikimedia.org/r/311162

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-09-20_(1.28.0-wmf.20)).Sep 19 2016, 8:00 PM

@Mooeypoo, the strategy Analytics uses to filter bots out of the pageview data is documented at https://meta.wikimedia.org/wiki/Research:Page_view/Tags#Spider. It's quite complex, so it's probably not worth implementing here. Fair point about legitimate anonymous editors; I may be able to look into that in my analysis.

I've added T146233: The RecentChangesLinked link in the (desktop) toolbox should be nofollowed to fix the root issue, BTW.

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016); removed Collab-Team-Q1-July-Sep-2016.Oct 5 2016, 1:07 AM

• jmatazzoni moved this task from Untriaged to Product Review on the Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016) board.

@Catrope and @Mooeypoo, this task, as written, describes our need to create a baseline of the existing tools' usage so we can compare it to the new tools' usage. Do I need to update this task to clarify that we also want to instrument the new tools? Or should I make that a new task?

Also, above, Roan noted that "we should add a field for whether that beta feature is enabled to the logging schema." Moriel, is that the same as what you asked me to document in this morning's meeting, or is this idea different?

• jmatazzoni moved this task from Product Review to QA Review on the Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016) board.Dec 21 2016, 12:59 AM

In T144331#2888927, @jmatazzoni wrote:

@Catrope and @Mooeypoo, tis task, as written, describes our need to create a baseline of the existing tools' usage so we can compare it to the new tools' usage. Do I need to update this task to clarify that we also want to instrument the new tools? Or should I make that a new task?

Also, above, Roan noted that "we should add a field for whether that beta feature is enabled to the logging schema." Moriel, is that the same as what you asked me to document in this morning's meeting, or is this idea different?

@jmatazzoni exactly. I'll work on this as we get closer to having a beta feature (we're currently working on the underlying code) -- I can just add another field to the logger "betaFeatureEnabled" with a true/false per user.

I think that would also cover the stats you're looking for?

Change 328447 had a related patch set uploaded (by Mooeypoo):
Add "enhancedFiltersEnabled" boolean to RC filters logger

https://gerrit.wikimedia.org/r/328447

Mooeypoo moved this task from QA Review to Needs Review on the Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016) board.Dec 21 2016, 1:21 AM

@Mooeypoo, I just wanted to check that all the newly added filters in the beta are also instrumented already (and that I don't have to make this a separate ticket). I.e., not only the completely new filter groups -- the Experience, User Intent and Contribution Quality groups-- but all the new individual filters, like Logged Actions and Page Edits.

Also, there are new capabilities that are, I suppose, based on old filters--or are they? So, though we now have a "new" filter named "Human (not bot)", I imagine that when I click that, behind the scenes it's still just "hidebots=1". But that must not be the whole story. Because what a user COULDN'T do in the past was show ONLY bots. So maybe you did need to create a new filter, so you could do that sort of thing ("hidebots=0&hidhumans=1")?

I better stop before I get too confused, but my point is, please confirm you've made it so we will we be able to measure ALL the new filters and sub-filters, including those that are the new complements to old filters?

Everything is being logged right now by the parameters given - the parameters represent the filters, but are not mapped to groups. So, you'll see hideanons=1 if a user selected it, but you will not see which conceptual filter group this belongs to.

To be more specific, in your example, the logger will log both cases: You will see "hidebots=0&hidehumans=1" if that is an available parameter (which I am not actually sure it is, but anyways) -- so the system logs what you actually use. It doesn't care if you're using "new" or "old" system because the parameters were implemented in the backend regardless of the new or old system. The only thing that the "new" system has that the old one doesn't is a user interface that allows you to properly pick those filters.

If a user knows what they're doing, they can use the "new" filter-parameters in the "old" system. We cannot prevent that. The filter will log that as using the "new" parameter.

That is why the commit above adds another variable stating whether the user has the beta feature or not. That way, you can split the results between using the old and new systems without making assumptions about what filter-names they use or not use, because that's not something we can trust.

In your data, you'll see the user used filter X and has the beta feature enabled - this means they use the new system. You'll see another user that used filter Y and has the beta feature not enabled - meaning they use the "old" system. You may see another user using filter X (supposedly a 'new' filter) but having the beta feature not enabled, meaning they are still using the "old" system, they just may know what they are doing and edited the URL manually -- or they copy/pasted a link from the new system.

You can't tell if a filter is new or old system, because they're technically possible to be used in both. You can tell only if the user has the new system enabled or not.

Does that make sense?

Sounds good.

Change 328447 merged by jenkins-bot:
Add "enhancedFiltersEnabled" boolean to RC filters logger

https://gerrit.wikimedia.org/r/328447

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-01-17_(1.29.0-wmf.8)).Jan 5 2017, 7:00 PM

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017); removed Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016).Jan 5 2017, 10:41 PM

• jmatazzoni moved this task from Untriaged to Blocked on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Jan 5 2017, 10:46 PM

@Mooeypoo says this is done, except for Highlighting, which is covered in T158344

Checked in betalabs eventlogging (deployment-eventlogging03.eqiad.wmflabs)

'enhancedFiltersEnabled' is added to https://meta.wikimedia.org/wiki/Schema:ChangesListFilters
in /srv/log/eventlogging/all-events.log each entry for RC has "enhancedFiltersEnabled": false, e.g.

{"event": {"enhancedFiltersEnabled": false, "pagename": "Recentchanges"}, "recvFrom": "deployment-cache-text04.deployment-prep.eqiad.wmflabs", "revision": 16174591, "schema": "ChangesListFilters", "seqId": 730282, "timestamp": 1487885939, "userAgent": "{\"os_minor\": null, \"os_major\": null, \"device_family\": \"Other\", \"os_family\": \"Linux\", \"wmf_app_version\": \"-\", \"browser_major\": \"55\", \"browser_family\": \"Chrome\"}", "uuid": "8bb5fcf934ab58db9b0ecf45130329cb", "webHost": "en.wikipedia.beta.wmflabs.org", "wiki": "enwiki"}

there was no 'clientValidated' error as far as I could see

QA recommendation: Resolve.

Etonkovidova moved this task from QA Review to Product Review on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Feb 23 2017, 11:08 PM

• jmatazzoni closed this task as Resolved.Feb 23 2017, 11:24 PM

nshahquinn-wmf moved this task from Blocked to Radar on the Contributors-Analysis board.Mar 29 2018, 9:05 AM

Instrument how often various filters on Special:Recentchanges are usedClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Instrument how often various filters on Special:Recentchanges are used
Closed, ResolvedPublic
Actions