Page MenuHomePhabricator

Investigation: Add more data points to Contributions tab (editing basics)
Closed, ResolvedPublic

Description

User stories:

As an event organizer, I want to be able to collect more data points on editing outcomes, so that I have a more rich and full summary of what was accomplished and what gaps existed in my event.

As an event participant, I want to see more data points on what I contributed or what happened in the event overall, so that I can feel motivated by the data and have a more complete picture of what was done.

Notes:

  • Words added is much more useful than characters added. We have heard in the past that characters added may be easier to implement, but we want to first investigate words added.

People to consult:

  • Talk to @Isaac about how we can get some of this data
  • Perhaps also talk to the Editing team (talk to Val)
Acceptance Criteria:
  • Investigate the feasibility of adding data points in the Contributions tab for the following:
    • Edit data
      • Number of words added (displayed in table and summary)
        • Available on: P&E Dashboard
      • Number of references added (displayed in table and summary)
        • Available on: P&E Dashboard
      • Edit summary - only text written by editors and no automatic tags (displayed in table only)
        • Not sure if this is available in any of the main tools!

Event Timeline

ifried updated the task description. (Show Details)
ifried added a subscriber: Isaac.

Important question: what will users be able to do with the data? E.g., sorting. This is important because if there's no sorting or anything, we don't need to store it and we can instead compute it on the fly, which would make things much simpler (and also very different).

@Daimona, I think users would probably want to sort it and, eventually, export it so it can be viewed off the wikis.

ifried renamed this task from Investigation: Add more data points to Contributions tab to Investigation: Add more data points to Contributions tab (editing basics).Oct 16 2025, 6:48 PM
ifried updated the task description. (Show Details)

I think that storing would be better, because:

  • It would make things easier if we want to generate reports on this data, like create reports to show on superset
  • We probably also want to show these data points in the contribution summary, and events may have many contributions, so computing from the DB would be better.
  • And also sorting as @Daimona said.

Yeah, I think for the data points remaining in this task, storing makes sense (my comment was more about other data points that have since moved to other tasks). Number of edits we already have (assuming it's a count of all contributions associated with the event, the AC don't say that), words and references might be tricky to compute but storing seems trivial. Edit summary OTOH can increase storage size, so a bit less ideal to store. In core this was mitigated years ago via normalization and the introduction of a comment table, but we can't do that here due to cross-wikiness.

Also for words: if we're extracting those from the wikitext, those will likely include template names and parameter names, etc. (Just something to keep in mind)

Also worth considering how to keep the summary growth under control (we're going to add a bunch of cards and we also don't have vertical wrapping)

Just quickly chiming in on words/references:

  • My mwedittypes library can do this. There's also a UI/API for it if you're curious to see what it looks like: https://wiki-topic.toolforge.org/diff-tagging. That's all hosted on Cloud Services and not actively maintained though so let me know before using it in anything live but fine for exploring, prototyping etc.
  • The references are relatively straightforward. The default wikitext-based approach (what's happening in the UI above) is just counting <ref> tags that are present in the wikitext. That means it will miss some things -- e.g., see PAWS:references-wikitext-vs-html.ipynb -- but probably good enough for the use-case of analytics. You can also do it via HTML, which will be far more accurate and is implemented in the Python library (just not exposed via the UI/API). The other difference is that on the HTML side, I distinguish between references (i.e. new sources in the reflist at the bottom) and citations (i.e. in-line usage of those references). The wikitext one is really doing citations though we could adjust to capture both I think if desired.
  • Words are more complex. Two things:
    • What is considered "text" in an article: the library currently strips out references, templates, images, lists, categories, and a few other things (code). Essentially aiming for gathering the core text in the article. This could be over-written though if you all are interested in a different set of elements. Using HTML here also brings some additional flexibility -- e.g., if you wanted to count words in infoboxes but not clean-up templates for instance.
    • How do you count "words" in text: once you have the core text, there is still the challenge of counting up words. For whitespace-delimited languages like English, that's pretty trivial (split on whitespace and because you don't care about specifics, you don't have to worry too much about cleaning up punctuation or stuff like that). For non-whitespace-delimited languages like Chinese or Thai, it's a lot trickier. We do have another library (mwtokenizer) for doing this but it's giving you the sorts of tokens you might hear discussed in the context of LLMs -- i.e. they aren't promised to be true words but instead common sequences-of-characters so sometimes full words but sometimes just chunks of words. For the moment, mwedittypes just falls back to saying how many characters were changed but I've been meaning to incorporate in the mwtokenizer logic so happy to talk about that if you're interested.
  • Of note, if you go with mwedittypes, you'd get some other elements for free -- e.g., how many images were added, how many clean-up templates were removed (HTML only), if an infobox was added (HTML only), and presumably any other element that you might want to report on.

Number of words added (displayed in table and summary)
The P&E Dashboard “Solves” this problem in a very brute force and vague way, using some research that only applies to English wikipedia, an “average” byte to word count has been extracted and applied bluntly to get an estimated number, this could be a very low-effort first draft for us.

Defining what a “word” is seems to be an extremely difficult problem and some languages don’t really map the concept well (chinese being a good example).

The PHP Intl extension appears to have decent locale-aware word boundaries which allow us to do things like:

$words = IntlBreakIterator::createWordInstance('zh'); $words->setText('最適なツール'); $count = 0; foreach( $words as $offset ){ if( IntlBreakIterator::WORD_NONE !== $words->getRuleStatus() ){ $count++; } } printf("%u words", $count );

Which gives an expected response but would need a chinese speaker to confirm, although I expect that given the Intl extension is a wrapper for ICU apis it should be at least a defensible answer.

Number of references added (displayed in table and summary)
The P&E Dashboard uses ORES to get this information, given ORES is being deprecated, I have looked into it’s replacement LiftWing. If you request the extended output, the Lift Wing language-agnostic article quality API does list the number of sources (unique references at end of article) and refs (specific citations in the article) in an article: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_language_agnostic_articlequality_prediction.

We would need to get the parent revision and compare the output to get the number of added references.

Alternatively, Isaac has built a python library https://pypi.org/project/mwparserfromhtml/ which can extract the same data if we wanted to “roll our own” solution
There is also an experimental tool that will provide further diff data in case we need to expand our offering at a later date (see https://wiki-topic.toolforge.org/diff-tagging for the UI or https://edit-types.wmcloud.org/docs ), but this is not yet production ready.

I would suggest liftwing as the most stable solution

Edit summary - only text written by editors and no automatic tags (displayed in table only)
the revisions API will provide the edit summaries for ~50 arbitrary revision IDs at a time. We should not request too many and batch any queries to this endpoint: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=12345%7C23456&rvprop=comment&format=json&formatversion=2

Notes from discussion with Michelle on Dec 3:

  • Order of implementation ease: 1) edit summary, 2) references, 3) word count
  • For word count, we will need to see if the PHP Intl extension is available - next steps will be to investigate if it is available
  • As for edit summary and references, they can become tickets

The PHP Intl extension appears to have decent locale-aware word boundaries [...]

Good find! I agree that whatever we end up using should be run by native speakers of various languages for a quick check. If the intl solution works (and I'd hope it does), that'd be great. (Also, a small error is probably tolerable)

Number of references added (displayed in table and summary)
the Lift Wing language-agnostic article quality API does list the number of sources (unique references at end of article) and refs (specific citations in the article) in an article
[...]
Alternatively, Isaac has built a python library https://pypi.org/project/mwparserfromhtml/ which can extract the same data if we wanted to “roll our own” solution
[...]
I would suggest liftwing as the most stable solution

It strikes me as a bit odd that from within MediaWiki, with the wikitext parser and everything available via PHP API, we need to use an external tool to parse wikitext. Now, it is true that, last time I checked, the wikitext parser didn't seem to expose, or collect, any metadata with statistics for the number of parser tags used. But then this means that those external tools are approximating and not counting the exact number (given also T407026#11295332); in that case, maybe we still don't need an external tool and can just do the approximation ourselves?

Still, I would maybe try reaching out to the Content-Transform-Team to inquire on the feasibility of collecting and exposing the metadata. If not possible (or not easy enough), then we could do the approximation.

Edit summary - only text written by editors and no automatic tags (displayed in table only)
the revisions API will provide the edit summaries for ~50 arbitrary revision IDs at a time. We should not request too many and batch any queries to this endpoint: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=12345%7C23456&rvprop=comment&format=json&formatversion=2

We have these available on the RevisionRecord object in PHP without having to make an HTTP request, but yeah, these are easily available. Extra storage is my only (minor) concern here, but as long as DBAs are made aware it should be fine, I believe.

FYI that there are some conversations starting around productionizing data about edit types that would be stored on our cluster in T410940#11441043. References and word counts would be two of those edit types (alongside lots of other things). Might not make sense to combine the efforts but might still make sense to have some discussions to understand what overlap exists and make sure the data aligns well at least if different counting approaches are taken.

It strikes me as a bit odd that from within MediaWiki, with the wikitext parser and everything available via PHP API, we need to use an external tool to parse wikitext. Now, it is true that, last time I checked, the wikitext parser didn't seem to expose, or collect, any metadata with statistics for the number of parser tags used. But then this means that those external tools are approximating and not counting the exact number (given also T407026#11295332); in that case, maybe we still don't need an external tool and can just do the approximation ourselves?

I'd definitely agree that you don't want to start from the wikitext if you can avoid it. The LiftWing API is gathering the counts from the HTML so the only logic is defining the particular HTML attributes that define what a reference is based on my explorations and the specs. In that sense, hopefully essentially as close to counting as you can get given that I don't know if there's any canonical definition for "reference" anywhere. You'll find some additional implementations from the WMDE folks in this codebase for generating reference metrics based on the Enterprise HTML snapshots (code) -- I haven't compared to see just how similar/different they are.

The investigation looks good to me, thanks @MHorsey-WMF

It strikes me as a bit odd that from within MediaWiki, with the wikitext parser and everything available via PHP API, we need to use an external tool to parse wikitext. Now, it is true that, last time I checked, the wikitext parser didn't seem to expose, or collect, any metadata with statistics for the number of parser tags used. But then this means that those external tools are approximating and not counting the exact number (given also T407026#11295332); in that case, maybe we still don't need an external tool and can just do the approximation ourselves?

While i see where you're coming from, given someone else has already done the work, is that not just reinventing the wheel a little bit?

I would maybe try reaching out to the Content-Transform-Team to inquire on the feasibility of collecting and exposing the metadata.

Response from content transformation in slack:

"we don't collect that metadata for the number of parser tags used (pretty sure it would have the potential to be messy ).
the folks from Enterprise might have something around these lines (I'm looking at https://enterprise.wikimedia.com/blog/parsed-references-with-scoring-models/) but that'd be generated from the Parsoid HTML (as far as I know), not the parser tag usage itself"

Suggestion that we stick with the liftwing plan (or attempt something ourselves)

Further comments from Subbu: on edits, when you have Parsoid HTML available, you can count the # of nodes with mw:Extension/ref typeof

It strikes me as a bit odd that from within MediaWiki, with the wikitext parser and everything available via PHP API, we need to use an external tool to parse wikitext. Now, it is true that, last time I checked, the wikitext parser didn't seem to expose, or collect, any metadata with statistics for the number of parser tags used. But then this means that those external tools are approximating and not counting the exact number (given also T407026#11295332); in that case, maybe we still don't need an external tool and can just do the approximation ourselves?

While i see where you're coming from, given someone else has already done the work, is that not just reinventing the wheel a little bit?

If it were just a matter of "the exact feature we need is available outside MW but not inside of it (despite having more low-level access from inside)", then it would be really weird, but as you say, we could still use the external source to avoid reinventing the wheel.

However, my understanding is that the external tool in question simply does an approximation, something akin to counting the number of matches of a regexp on the wikitext. In that regard, I'd like to just copy (and adapt if needed) the regexp rather than going through HTTP just for that.

I would maybe try reaching out to the Content-Transform-Team to inquire on the feasibility of collecting and exposing the metadata.

Response from content transformation in slack:

"we don't collect that metadata for the number of parser tags used (pretty sure it would have the potential to be messy ).
the folks from Enterprise might have something around these lines (I'm looking at https://enterprise.wikimedia.com/blog/parsed-references-with-scoring-models/) but that'd be generated from the Parsoid HTML (as far as I know), not the parser tag usage itself"

Suggestion that we stick with the liftwing plan (or attempt something ourselves)

Thanks for looking into this! Makes sense to me.

Further comments from Subbu: on edits, when you have Parsoid HTML available, you can count the # of nodes with mw:Extension/ref typeof

Is this something we can get from e.g. the ParserOutput or another public API? We're not doing it on edit but later on (in a job).

However, my understanding is that the external tool in question simply does an approximation, something akin to counting the number of matches of a regexp on the wikitext. In that regard, I'd like to just copy (and adapt if needed) the regexp rather than going through HTTP just for that.

See my comment above (T407026#11442305) but the LiftWing quality API does actually do an exact count based on Parsoid tag attributes (what Subbu recommends with /ref typeof is how refs works below). Obviously welcome to fetch the HTML and extract yourself but the API gives you two relevant feature counts in its extended output:

Apologies for the ambiguous terminology (sources/refs/citations) -- there's no agreed-upon definition for these things unfortunately and they tend to be used interchangeably.

However, my understanding is that the external tool in question simply does an approximation, something akin to counting the number of matches of a regexp on the wikitext. In that regard, I'd like to just copy (and adapt if needed) the regexp rather than going through HTTP just for that.

See my comment above (T407026#11442305) but the LiftWing quality API does actually do an exact count based on Parsoid tag attributes (what Subbu recommends with /ref typeof is how refs works below).

Thank you! I had forgotten about that.

Obviously welcome to fetch the HTML and extract yourself but the API gives you two relevant feature counts in its extended output:

I don't think we would need these now, so I would rather have a pure MW implementation as that would simplify things. Nonetheless, this also seems to be something useful to have in MW itself. With that being said though, my question above remains, i.e. how to get the Parsoid HTML in MW (I'm sure it's possible, but I've never done it before).

@Daimona

how to get the Parsoid HTML in MW

We can get the parsoid markup from the REST api. You can "roll your own" by getting the revision wikitext and feeding it to Parsoid directly to get the output but that's not trivial and not recommended.

@Daimona

how to get the Parsoid HTML in MW

We can get the parsoid markup from the REST api. You can "roll your own" by getting the revision wikitext and feeding it to Parsoid directly to get the output but that's not trivial and not recommended.

That looks a bit problematic and seems to suggest that some functionality is missing, but if it's the recommended way, we can do it. No other questions from me!

Note: We probably want to get article quality in the future as well, and @Isaac shared: if you want articlequality, let me know. the model is on LiftWing and also has the reference data in it so it might not even require another API call depending on how you all handle the reference counting.

This investigation is complete. For now, we have decided to prioritize adding number of references added/removed. Thank you for this work!