Page MenuHomePhabricator

Implement data instrumentation to monitor how Nuke users filter pages to delete
Open, Needs TriagePublic

Description

Background
Administrators can use Special:Nuke (added by Extension:Nuke) to mass delete pages created on their Wikimedia project. The typical use case for this is vandalism - a user mass-created pages that need to be deleted, and this tool allows administrators to clean these pages up in a few clicks, rather than needing to delete each page individually.

When listing pages for deletion, users have a number of filtering options:

Screenshot 2024-12-05 at 15.22.17.png (1×1 px, 232 KB)

There are currently three primary options administrators have when filtering:

  • Leave both the username and SQL LIKE fields blank: All recently created pages are returned.
  • Add a username: All pages recently created by this specific user are returned.
  • Add regex to the SQL LIKE filter: All recently created pages with titles matching this regex are returned.

They can also combine both the username and SQL LIKE filters to retrieve pages matching both.

It is currently almost impossible to know how users filtered page creations to delete them - did they use the username filter, the SQL LIKE filter, the namespace filter, and/or just target all recently created pages?

We want to know this information for two reasons. The first is performance. In response to user requests we want to both increase the length of time over which Nuke can retrieve pages for deletion (currently just 30 days; T380846), and add additional filtering options (e.g. T95797, T378488). We have found (T380846#10379277) that we are already running into performance problems when using the SQL LIKE filter, and are concerned that adding more filters will make the situation worse. This has already caused complications when attempting to increase the max age of pages to be deleted. We want to know if anyone uses the SQL LIKE filter, and if so for what reason, to see if we might be able to solve those use cases in other tools and potentially remove this filter from Nuke.

The second reason is to evaluate new features - when we add these new filtering options we would like to know how many administrators are using them so that we can evaluate their impact.

On top of measuring which filters are used, we could also use this instrumentation to directly measure Nuke's performance. By emitting another event, linked to the first by an ID, when pages are returned, we can measure how long it took to present the user with the list of pages.

Proposed implementation
Each time a user clicks 'List pages' we want to collect the values in each field of the form. We want to both know if someone used the filter, and what the value of the filter is. This will help us understand both whether a field is used, and how.

Contextual attributes
We would like to collect the following optional attributes for each event:

  • mediawiki_database - so that we can evaluate differences by Wikimedia project type.
  • performer_session_id - so we can parse out individual users doing many listings, if this becomes apparent.

Data collection risk tier
Low Risk (data collection activity log form submitted).

Data retention plan
Standard Metrics Platform retention.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Samwalton9-WMF renamed this task from Collect data on how Nuke users filtered pages to delete to Collect data on how Nuke users filter pages to delete.Sep 24 2024, 1:04 PM
Samwalton9-WMF added a subscriber: VirginiaPoundstone.

We discussed that this seems like the kind of thing the Metrics Platform could be used for. @VirginiaPoundstone does that seem accurate? We have a simple form with a few fields, and want to log data about which form fields were used (perhaps also what their values were) in a given query. Data volume should be relatively low since this is an admin-only form.

@Samwalton9-WMF Yes! This seems like something that would a great use of the Metrics Platform tools. It sounds like you are interested is just gathering baseline product health data for now (as opposed to testing a couple variations of the form).

If you haven't already, have a look at the how to docs and let us know if you have questions– we are happy to help! You can ping us with quick questions in the Metrics Platform slack channel.

Samwalton9-WMF renamed this task from Collect data on how Nuke users filter pages to delete to Implement data instrumentation to monitor how Nuke users filter pages to delete.Mon, Dec 9, 12:02 PM

@VirginiaPoundstone Sam W. just attended a consultation with Megan and myself and one question came up which touches on our ongoing discussion of data bagging/stuffing.

The information that Moderator Tools would like to collect with events is not supported by the base schema and would require a custom schema fragment because they'd need to collect:

  • The value used in the username / IP address filter (if any)
  • The value used in the SQL LIKE filter (if any)
  • The namespace selected (e.g. "all" or "Talk")
  • The number of pages to retrieve
  • The time it took to retrieve the pages

Our recommendation would be to follow https://wikitech.wikimedia.org/wiki/Metrics_Platform/Custom_schemas to explicitly model that data rather than stuff a JSON blob into, say, action_context, and I just wanted to verify that your team concurs with that recommendation.

@VirginiaPoundstone Sam W. just attended a consultation with Megan and myself and one question came up which touches on our ongoing discussion of data bagging/stuffing.

The information that Moderator Tools would like to collect with events is not supported by the base schema and would require a custom schema fragment because they'd need to collect:

  • The value used in the username / IP address filter (if any)
  • The value used in the SQL LIKE filter (if any)
  • The namespace selected (e.g. "all" or "Talk")
  • The number of pages to retrieve
  • The time it took to retrieve the pages

Our recommendation would be to follow https://wikitech.wikimedia.org/wiki/Metrics_Platform/Custom_schemas to explicitly model that data rather than stuff a JSON blob into, say, action_context, and I just wanted to verify that your team concurs with that recommendation.

100% concur; thanks @mpopov!