Page MenuHomePhabricator

Investigation: Add more data points to Contributions tab (editing basics)
Open, Needs TriagePublic

Description

User stories:

As an event organizer, I want to be able to collect more data points on editing outcomes, so that I have a more rich and full summary of what was accomplished and what gaps existed in my event.

As an event participant, I want to see more data points on what I contributed or what happened in the event overall, so that I can feel motivated by the data and have a more complete picture of what was done.

Notes:

  • Words added is much more useful than characters added. We have heard in the past that characters added may be easier to implement, but we want to first investigate words added.

People to consult:

  • Talk to @Isaac about how we can get some of this data
  • Perhaps also talk to the Editing team (talk to Val)
Acceptance Criteria:
  • Investigate the feasibility of adding data points in the Contributions tab for the following:
    • Edit data
      • Number of words added (displayed in table and summary)
        • Available on: P&E Dashboard
      • Number of references added (displayed in table and summary)
        • Available on: P&E Dashboard
      • Edit summary - only text written by editors and no automatic tags (displayed in table only)
        • Not sure if this is available in any of the main tools!

Event Timeline

ifried updated the task description. (Show Details)
ifried added a subscriber: Isaac.

Important question: what will users be able to do with the data? E.g., sorting. This is important because if there's no sorting or anything, we don't need to store it and we can instead compute it on the fly, which would make things much simpler (and also very different).

@Daimona, I think users would probably want to sort it and, eventually, export it so it can be viewed off the wikis.

ifried renamed this task from Investigation: Add more data points to Contributions tab to Investigation: Add more data points to Contributions tab (editing basics).Oct 16 2025, 6:48 PM
ifried updated the task description. (Show Details)

I think that storing would be better, because:

  • It would make things easier if we want to generate reports on this data, like create reports to show on superset
  • We probably also want to show these data points in the contribution summary, and events may have many contributions, so computing from the DB would be better.
  • And also sorting as @Daimona said.

Yeah, I think for the data points remaining in this task, storing makes sense (my comment was more about other data points that have since moved to other tasks). Number of edits we already have (assuming it's a count of all contributions associated with the event, the AC don't say that), words and references might be tricky to compute but storing seems trivial. Edit summary OTOH can increase storage size, so a bit less ideal to store. In core this was mitigated years ago via normalization and the introduction of a comment table, but we can't do that here due to cross-wikiness.

Also for words: if we're extracting those from the wikitext, those will likely include template names and parameter names, etc. (Just something to keep in mind)

Also worth considering how to keep the summary growth under control (we're going to add a bunch of cards and we also don't have vertical wrapping)

Just quickly chiming in on words/references:

  • My mwedittypes library can do this. There's also a UI/API for it if you're curious to see what it looks like: https://wiki-topic.toolforge.org/diff-tagging. That's all hosted on Cloud Services and not actively maintained though so let me know before using it in anything live but fine for exploring, prototyping etc.
  • The references are relatively straightforward. The default wikitext-based approach (what's happening in the UI above) is just counting <ref> tags that are present in the wikitext. That means it will miss some things -- e.g., see PAWS:references-wikitext-vs-html.ipynb -- but probably good enough for the use-case of analytics. You can also do it via HTML, which will be far more accurate and is implemented in the Python library (just not exposed via the UI/API). The other difference is that on the HTML side, I distinguish between references (i.e. new sources in the reflist at the bottom) and citations (i.e. in-line usage of those references). The wikitext one is really doing citations though we could adjust to capture both I think if desired.
  • Words are more complex. Two things:
    • What is considered "text" in an article: the library currently strips out references, templates, images, lists, categories, and a few other things (code). Essentially aiming for gathering the core text in the article. This could be over-written though if you all are interested in a different set of elements. Using HTML here also brings some additional flexibility -- e.g., if you wanted to count words in infoboxes but not clean-up templates for instance.
    • How do you count "words" in text: once you have the core text, there is still the challenge of counting up words. For whitespace-delimited languages like English, that's pretty trivial (split on whitespace and because you don't care about specifics, you don't have to worry too much about cleaning up punctuation or stuff like that). For non-whitespace-delimited languages like Chinese or Thai, it's a lot trickier. We do have another library (mwtokenizer) for doing this but it's giving you the sorts of tokens you might hear discussed in the context of LLMs -- i.e. they aren't promised to be true words but instead common sequences-of-characters so sometimes full words but sometimes just chunks of words. For the moment, mwedittypes just falls back to saying how many characters were changed but I've been meaning to incorporate in the mwtokenizer logic so happy to talk about that if you're interested.
  • Of note, if you go with mwedittypes, you'd get some other elements for free -- e.g., how many images were added, how many clean-up templates were removed (HTML only), if an infobox was added (HTML only), and presumably any other element that you might want to report on.