Page MenuHomePhabricator

Instrument how often various filters on Special:Recentchanges are used
Closed, ResolvedPublic

Description

In order to provide a baseline to compare usage of the new ORES-based filters against, we should get instrumentation about filter usage early.

We would probably want something like:

  • The state of each boolean filter (minor, bots, unregistered, registered, myself, categorization, wikidata*, nondamaging*) for each view of Special:RecentChanges
  • The effective value of the namespace filter (if set), taking into account the "Invert" and "Associated namespace" checkboxes
  • The value of the tag filter

This task is for adding instrumentation to RC to record this data, but @Neil_P._Quinn_WMF et al should probably weigh in on how we can best instrument this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 308091 had a related patch set uploaded (by Mooeypoo):
Add EventLogging for Special:RecentChanges filter usage

https://gerrit.wikimedia.org/r/308091

Change 308091 merged by jenkins-bot:
Add EventLogging for Special:RecentChanges filter usage

https://gerrit.wikimedia.org/r/308091

This isn't actionable now, but for posterity: once a beta feature changing the appearance of the RC filters exists, we should add a field for whether that beta feature is enabled to the logging schema.

As discussed in the meeting, that commit measures per-view overrides using the URL query string.

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

This should be taken into account when analyzing the results.

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

@Mooeypoo and @Mattflaschen-WMF - in user Preferences there is 'ORES sensitivity' setting (Low and High options) - are we interested in tracking it?

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

@Mooeypoo and @Mattflaschen-WMF - in user Preferences there is 'ORES sensitivity' setting (Low and High options) - are we interested in tracking it?

Oh, that's an interesting idea, we probably should. We can already get stats on whether more people use high or low sensitivity, but without this we couldn't tell if people with the high/low sensitivity setting use the filter more/less often.

This looks really good to me! I've been trying to derive this data from the raw webrequest logs, but this will be a much cleaner and simpler way of getting the data.

As discussed in the meeting, that commit measures per-view overrides using the URL query string.

It intentionally does not track either the default settings or changes to the RC filters that the user made in their preferences.

This should be taken into account when analyzing the results.

Definitely. I'm already working on analyzing user preferences related to RC, so this will complement that.

Oh, that's an interesting idea, we probably should. We can already get stats on whether more people use high or low sensitivity, but without this we couldn't tell if people with the high/low sensitivity setting use the filter more/less often.

Interesting idea. The same reasoning applies to any preference that can be overridden by a toggle on the page: the max number of days, the max number of entries, hiding minor edits, hiding categorization, and showing Wikidata entries. But maybe we're just less interested in those?

In T144331#2623326, @Neil_P._Quinn_WMF wrote:

Interesting idea. The same reasoning applies to any preference that can be overridden by a toggle on the page: the max number of days, the max number of entries, hiding minor edits, hiding categorization, and showing Wikidata entries. But maybe we're just less interested in those?

Those aren't overridden so much as that their default values are configurable, and that's already taken into account in a way that I don't remember. @Mooeypoo can you explain that?

In T144331#2623326, @Neil_P._Quinn_WMF wrote:

Interesting idea. The same reasoning applies to any preference that can be overridden by a toggle on the page: the max number of days, the max number of entries, hiding minor edits, hiding categorization, and showing Wikidata entries. But maybe we're just less interested in those?

Those aren't overridden so much as that their default values are configurable, and that's already taken into account in a way that I don't remember. @Mooeypoo can you explain that?

I'm sorry, I don't understand the question? Are we talking about tracking users' preferences alongside the requested filters they're asking? Didn't we say we don't want to do that?

Didn't we say we don't want to do that?

Yes

What I mean is: we aren't tracking users' preferences for the default values of filters alongside the effective values they chose. That's fine. However, the hidenondamaging filter has a preference that's not about whether the filter defaults to being on or off, but that influences how it works (changes the sensitivity level). That's the one that Elena proposed we add.

This is producing results now, but unfortunately they're dominated by crawlers. I'll ask the analytics people if they know of any tricks, but we should probably turn this off soon, and do something to reduce the volume before we turn it back on. This has been running on full blast (group2) for less than 24 hours and we already have 4.7 million rows.

Random idea: what if we only logged this for logged-in users?

Random idea: what if we only logged this for logged-in users?

I was about to suggest the exact same thing. If not collect only logged-in users, at the very least tag the data with whether or not that came from a logged-in user.

There must be a way to recognize crawler from human user, though? Aren't most crawlers "identifying" themselves?

Change 311162 had a related patch set uploaded (by Catrope):
Only log ChangesList filters for logged-in users

https://gerrit.wikimedia.org/r/311162

Random idea: what if we only logged this for logged-in users?

I was about to suggest the exact same thing. If not collect only logged-in users, at the very least tag the data with whether or not that came from a logged-in user.

That's also a good idea, but because we're also receiving a LOT of data, I think it'll be better to not log data from anons in the first place, to keep the logging volume under control.

There must be a way to recognize crawler from human user, though? Aren't most crawlers "identifying" themselves?

Kind of. They have User-Agent headers that contain words like BingBot or Yahoo! Slurp or blahblah Search Engine, etc. I came up with 10 words to filter out before I gave up. The Analytics people probably have a good list of crawler UAs, but if we can just drop logged-out users, why bother.

Kind of. They have User-Agent headers that contain words like BingBot or Yahoo! Slurp or blahblah Search Engine, etc. I came up with 10 words to filter out before I gave up. The Analytics people probably have a good list of crawler UAs, but if we can just drop logged-out users, why bother.

The only reason I would say we should bother is if that data is important; Do we not want to get the data about how anonymous users (real users, not robots) use the filters on the page? If we ignore that group, will it affect our analysis? How many actual anonymous users use this page to begin with?

I think those might be valid questions we would like to answer. If the way to filter out bots is not insane, then it might be worth trying to keep this data in -- but tag filters with whether or not the user was logged in.

Change 311162 merged by jenkins-bot:
Only log ChangesList filters for logged-in users

https://gerrit.wikimedia.org/r/311162

@Mooeypoo, the strategy Analytics uses to filter bots out of the pageview data is documented at https://meta.wikimedia.org/wiki/Research:Page_view/Tags#Spider. It's quite complex, so it's probably not worth implementing here. Fair point about legitimate anonymous editors; I may be able to look into that in my analysis.

@Catrope and @Mooeypoo, this task, as written, describes our need to create a baseline of the existing tools' usage so we can compare it to the new tools' usage. Do I need to update this task to clarify that we also want to instrument the new tools? Or should I make that a new task?

Also, above, Roan noted that "we should add a field for whether that beta feature is enabled to the logging schema." Moriel, is that the same as what you asked me to document in this morning's meeting, or is this idea different?

@Catrope and @Mooeypoo, tis task, as written, describes our need to create a baseline of the existing tools' usage so we can compare it to the new tools' usage. Do I need to update this task to clarify that we also want to instrument the new tools? Or should I make that a new task?

Also, above, Roan noted that "we should add a field for whether that beta feature is enabled to the logging schema." Moriel, is that the same as what you asked me to document in this morning's meeting, or is this idea different?

@jmatazzoni exactly. I'll work on this as we get closer to having a beta feature (we're currently working on the underlying code) -- I can just add another field to the logger "betaFeatureEnabled" with a true/false per user.

I think that would also cover the stats you're looking for?

Change 328447 had a related patch set uploaded (by Mooeypoo):
Add "enhancedFiltersEnabled" boolean to RC filters logger

https://gerrit.wikimedia.org/r/328447

@Mooeypoo, I just wanted to check that all the newly added filters in the beta are also instrumented already (and that I don't have to make this a separate ticket). I.e., not only the completely new filter groups -- the Experience, User Intent and Contribution Quality groups-- but all the new individual filters, like Logged Actions and Page Edits.

Also, there are new capabilities that are, I suppose, based on old filters--or are they? So, though we now have a "new" filter named "Human (not bot)", I imagine that when I click that, behind the scenes it's still just "hidebots=1". But that must not be the whole story. Because what a user COULDN'T do in the past was show ONLY bots. So maybe you did need to create a new filter, so you could do that sort of thing ("hidebots=0&hidhumans=1")?

I better stop before I get too confused, but my point is, please confirm you've made it so we will we be able to measure ALL the new filters and sub-filters, including those that are the new complements to old filters?

Everything is being logged right now by the parameters given - the parameters represent the filters, but are not mapped to groups. So, you'll see hideanons=1 if a user selected it, but you will not see which conceptual filter group this belongs to.

To be more specific, in your example, the logger will log both cases: You will see "hidebots=0&hidehumans=1" if that is an available parameter (which I am not actually sure it is, but anyways) -- so the system logs what you actually use. It doesn't care if you're using "new" or "old" system because the parameters were implemented in the backend regardless of the new or old system. The only thing that the "new" system has that the old one doesn't is a user interface that allows you to properly pick those filters.

If a user knows what they're doing, they can use the "new" filter-parameters in the "old" system. We cannot prevent that. The filter will log that as using the "new" parameter.

That is why the commit above adds another variable stating whether the user has the beta feature or not. That way, you can split the results between using the old and new systems without making assumptions about what filter-names they use or not use, because that's not something we can trust.

In your data, you'll see the user used filter X and has the beta feature enabled - this means they use the new system. You'll see another user that used filter Y and has the beta feature not enabled - meaning they use the "old" system. You may see another user using filter X (supposedly a 'new' filter) but having the beta feature not enabled, meaning they are still using the "old" system, they just may know what they are doing and edited the URL manually -- or they copy/pasted a link from the new system.

You can't tell if a filter is new or old system, because they're technically possible to be used in both. You can tell only if the user has the new system enabled or not.

Does that make sense?

Change 328447 merged by jenkins-bot:
Add "enhancedFiltersEnabled" boolean to RC filters logger

https://gerrit.wikimedia.org/r/328447

Checked in betalabs eventlogging (deployment-eventlogging03.eqiad.wmflabs)

{"event": {"enhancedFiltersEnabled": false, "pagename": "Recentchanges"}, "recvFrom": "deployment-cache-text04.deployment-prep.eqiad.wmflabs", "revision": 16174591, "schema": "ChangesListFilters", "seqId": 730282, "timestamp": 1487885939, "userAgent": "{\"os_minor\": null, \"os_major\": null, \"device_family\": \"Other\", \"os_family\": \"Linux\", \"wmf_app_version\": \"-\", \"browser_major\": \"55\", \"browser_family\": \"Chrome\"}", "uuid": "8bb5fcf934ab58db9b0ecf45130329cb", "webHost": "en.wikipedia.beta.wmflabs.org", "wiki": "enwiki"}
  • there was no 'clientValidated' error as far as I could see

QA recommendation: Resolve.