Page MenuHomePhabricator

Improved edit summary data in mediawiki_history
Open, Needs TriagePublic

Description

Edit summaries contains various forms of structured data that require regular expressions or other string-based approaches to be identified and manipulated. Having this information extracted and useable in a different format would make working with this data a lot easier.

We currently know of the following types of data in edit summaries:

  1. Hashtags. These are freeform text starting with "#" and commonly used by various campaigns (e.g. #1lib1ref, #wpwp). These can be made available as an array without the starting "#", similar to how edit tags are available in revision_tags.
  2. Structured Data on Commons (/wikibase?). These are identifiable from being wrapped in "/* … */", where the parts inside have a format of "wb[keyword1]-[keyword2]:" followed by some combination of numbers and "|".
  3. GrowthExperiments structured tasks. These have a similar format to the Structured Data on Commons summaries. They start with "growthexperiments-" followed by either"addlink" or "addimage", followed by "-summary-summary:" If the second keyword is "addimage", following the ":" is always " 1". If the keyword is "addlink", then the ":" is followed by three integers separated by "|". The first integer is the number of links added in the edit.

What data structure is the best fit for the latter two cases is something we can figure out as part of this task. There might also be additional types of data available in edit summaries that we can identify and add to the list of examples.

Event Timeline

I think these are two fairly different problems that are better discussed in separate tasks. Hashtags make sense in the edit summary, but it would be nice if they were easily searchable/filterable. I filed T323875: Turn edit summary hashtags into change tags about what I think would be a good low-effort solution. The multilingual edit summaries that Wikidata, Commons structured data and some Growth features use don't contain any interesting data worth extracting, OTOH they don't really make sense outside of MediaWiki so they can be disruptive for patrol tooling (e.g. NavPopups will show them as-is when inspecting a difflink). T215637: Implement translatable edit summaries / multilingual comments using comment_data seems like a potential approach for improving that; I proposed another one in T323879: Action API response field for showing "partially parsed" comments (for autocomments).

The multilingual edit summaries that Wikidata, Commons structured data and some Growth features use don't contain any interesting data worth extracting

We have to work with those and extract data from those occasionally (e.g. when reporting on image caption and title description additions/translations within the Wikipedia Android app), and it would be useful to have that data processed and restructured in MW history dataset.

I love this task as another user of hashtags and structured summaries. If adding these fields to mediawiki_history ends up being judged not feasible, I wonder if there's an alternative solution that uses the upcoming page change stream (T308017) and enriches it with these different parsings? Not available in mediawiki_history then but you would have an event table with the data that could be filtered / joined in with mediawiki_history.