Page MenuHomePhabricator

[SPIKE] Investigate edit tag additional data storage approach
Closed, ResolvedPublic

Description

Per T342189, experienced volunteers need a way to see all of the published edits an Edit Check was presented within and all of the Edit Check that were activated within said edit.

This way, volunteers can name patterns common among false positives and low quality/disruptive edits and propose revisions to counteract these trends.

This task involves investigating the viability of storing/layering on additional metadata within edit tags T324733 introduced, and other tags like it in the future.

The broader idea here is that if the approach this task is investigating proves viable, we could use the combination of edit tags and the metadata we can store "within" them to "compose" a view like Special:AbuseLog.

Open questions

  • 1. What – if any – limitations exist on what data (size, format, etc.) can be stored/associated with an edit tag?
  • 2. Assuming this data storage approach can "house" the data needed to build a view like the one T342189 describes, what – if any – performance considerations should we be aware of before moving forward with implementation?
  • 3. What – if any – adjustment(s) will we make to how Edit Check-related tags are flagged to ensure Edit Check tags don't "bleed" between discrete edits/edit sessions?

Done

  • Answers to all Open questions are documented above

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Notes @DLynch shared offline:

  • This ticket is useful to making it possible for volunteers to review individual edits based on the specific Edit Check(s) shown within it
  • This ticket is NOT useful to creating a higher-level dashboard that would enable the Editing Team to monitor the Edit Check system at a higher level (
    • % of edits an Edit Check is activated within
    • Revert rate of edits any Edit Check is activated within
    • Edit completion rate of edits any Edit Check is activated within
    • etc.

The easy part: storing some data on a tag

ChangeTags can have a params associated with an instance of a tag. This is stored in the DB as ct_params with a type of blob, meaning it can hold up to 64kB of data.

It's not exposed via APIs, but exists for direct usage by other code. Current uses seem to be:

  • mw-core: used for revert-tracking -- various tags to do with reversions store a serialized EditResult
  • MassMessage: stores a hash in the massmessage-delivery tag to avoid sending duplicate messages
  • PushAll: stores some data about other wikis that a revision has been pushed to on pushall-push

All current uses store this data as JSON (via FormatJson::encode). It looks like technically you could store any string you want there, however -- ChangeTagsStore->updateTags just types it as string|null.

Storing this data would require some minor changes to our current tag-storage method -- we currently add tags after-save in a RecentChange_Save hook, and just use the convenient RecentChange->addTags method... which doesn't support adding this data. I don't think this would involve much of a change, however. (It might need us to fire off a separate queued update in order to save the tags after the recentchange hook has run -- I'm not sure about the exact sequencing there.)

The hard part: exposing that stored data

  1. Nothing in Special:RecentChanges is currently set up to expose this data. Either to show anything about it, or to search revisions based on its contents.
  2. The nature of the storage means that generically showing the ct_params contents for all tags isn't desirable -- most of it is currently just blobs of JSON data that aren't relevant to users in any way.

As such, exposing the data would need some kind of dedicated effort on our part. I think there's a few plausible paths:

  1. Set up some system of hooks that allow extensions to register tags that need special handling when they're displayed (just on Special:RecentChanges? everywhere?), and either:
    1. Add some sort of on-page display for this data
    2. Add some standardized link to a special page showing data about that tag
  2. Make a Special:RecentEditChecks page that explicitly knows about our tags, fetches recent revisions with those tags, and displays the data we know to have been attached to those edits
  3. Make a Special:RevisionEditCheck page that just takes a revisionID and shows whatever editcheck-related data might be attached to it, and let patrollers know they can use it in concert with filtering by tags on Special:RecentChanges

The latter is a lot simpler, but I could see other extensions being interested in taking advantage of a more extendable base system if we made it. GrowthExperiments, for instance, might want to encode a bunch of information about exactly what experiment was being acted on when someone makes a structured edit.

What would we even display?

We'd need to decide on a few things. Mostly, how to usefully summarize and display what editcheck activity occurred in a revision. This could actually be quite a lot of data -- imagine someone who wrote a long article and triggered a lot of mid-edit checks, for instance. Is it useful to a patroller to know about all those, or is it just noise?

If we restricted ourselves to storing data about the pre-save checks, because we know that they're relevant to the state of the article-as-saved, we could probably show that alongside a view of the article. E.g. "<-- this reference was added in response to a check", "<-- a reference was suggested here, but rejected because: common-knowledge".

There's a lot of edge cases. It'd be easy to show "in the course of this edit, the editor was show: 7 reference checks, 3 peacock checks, 982 paste checks", but the more detail we want to show the more we get into the weeds. (Also: privacy concerns if we try to store anything about the content that's not in the final revision...)

Implied changes to our tags

Anything that isn't a tag we reasonably think a patroller would actively want to filter Special:RecentChanges on should get rolled into just one tag. Probably called something like editcheck-shown.

Oh, and performance considerations: it depends on exactly what arrangement of tags we wind up with. The rule of thumb is that anything that's contained inside the ct_params data would be expensive to make filterable/sortable, potentially at the level where we don't want to make it available via the frontend. It'd all be available for analysis, because it's not awful if those queries take a while to run because they have to deserialize a bunch of JSON per-row.

There are MySQL functions specifically for querying things stored as JSON, but they rely on the database being set up in a way that ours isn't. There's a specific JSON column-type that they want to be used, which isn't the BLOB this data is stored as.

So if we think a very common case is going to be patrollers wanting a list of only revisions that have shown the add-revision check, it'd most likely be helpful if we had a specific tag for that check. But if we think the common flow is going to be "show me revisions where someone was shown any checks" and then they'll work through them individually, a single editcheck-shown tag would be sufficient.

If we can, I'd prefer to squash everything down onto one tag. It reduces proliferation, and means we don't need to keep on making code changes every time a new check happens -- right now the VisualEditor extension needs to maintain a list of tag-names in PHP so that we can add them, which would be an annoying bottleneck for letting other extension-created checks get in on the logging. (And would be basically impossible for community-created checks to integrate with.)