Edit summaries contains various forms of structured data that require regular expressions or other string-based approaches to be identified and manipulated. Having this information extracted and useable in a different format would make working with this data a lot easier.
We currently know of the following types of data in edit summaries:
- Hashtags. These are freeform text starting with "#" and commonly used by various campaigns (e.g. #1lib1ref, #wpwp). These can be made available as an array without the starting "#", similar to how edit tags are available in revision_tags.
- Structured Data on Commons (/wikibase?). These are identifiable from being wrapped in "/* … */", where the parts inside have a format of "wb[keyword1]-[keyword2]:" followed by some combination of numbers and "|".
- GrowthExperiments structured tasks. These have a similar format to the Structured Data on Commons summaries. They start with "growthexperiments-" followed by either"addlink" or "addimage", followed by "-summary-summary:" If the second keyword is "addimage", following the ":" is always " 1". If the keyword is "addlink", then the ":" is followed by three integers separated by "|". The first integer is the number of links added in the edit.
What data structure is the best fit for the latter two cases is something we can figure out as part of this task. There might also be additional types of data available in edit summaries that we can identify and add to the list of examples.