Page MenuHomePhabricator

Investigate how to put 'Words added' into 'Event Summary' and 'Pages Created' reports
Open, Needs TriagePublic

Description

Organizers, their sponsors and partners want to know how much work was done during an event. "Bytes changed" has been a metric for this, but it is imperfect in many ways. "Words added" would be a way to judge how much writing was accomplished, but this metric is problematic.

This metric will be used in (to begin with) the Event Summary (T205561 and T206692 ) and Pages Created (T206058 and T205502) reports. The metric will appear in different forms in these two reports:

  • In Pages Created and Improved reports: These reports list is a list of articles created and improved, so the metric will show the net change in words to the given article. Use a minus sign to indicate negative numbers, in the event of negative change (possible for Pages Improved, though not for Pages Created).
  • In Event Summary reports: here, the metric will count the net change in words added during the event—essentially a sum of the figures above.

It is understood that this figure may be easier to derive for some languages and scripts than it is for others where, for example, established bytes-to-words or characters-to-words conversion rates may not be readily available. It's also understood that the figure may not be accurate—since it may be difficult to separate wikitext from content, for example.

Various methods have been discussed for arriving at this metric. Each has pros and cons. The task here is to figure out how we might most feasibly derive the metric for the most languages, and to give a sense for the level of effort this would require. Accuracy may not be a paramount value: Organizers I spoke with recently (in a session at WikiConference North America) expressed a willingness to tolerate a high level of inaccuracy in order to get the metric.

Event Timeline

For this I recommend we use the XTools API or steal the same code. It works by grabbing the HTML (so wikitext is ignored), looking inside the body of the article, and counting words separated by a space. It's relatively fast (~100ms for most articles), and seems to be fairly accurate for latin languages. I have not done any thorough testing on non-latin languages.

For this I recommend we use the XTools API or steal the same code. It works by grabbing the HTML (so wikitext is ignored), looking inside the body of the article, and counting words separated by a space. It's relatively fast (~100ms for most articles), and seems to be fairly accurate for latin languages. I have not done any thorough testing on non-latin languages.

@MusikAnimal has suggested a method for calculating Words Added that sounds much more accurate than the 'Bytes changed" conversion we discussed earlier.

@Niharika mentions that the "Bytes changed" conversion may be a little more accurate than we at first supposed. Do we think one of these will work? And for how many languages?

@Niharika mentions that the "Bytes changed" conversion may be a little more accurate than we at first supposed. Do we think one of these will work? And for how many languages?

Quite the contrary, I don't think it will be accurate at all. It's just simpler to do, from what I can tell because we're not parsing the wikitext.
However this is the way the Dashboard does it afaik. I have no clue if they even do it for multiple languages. I think Leon's proposal could work well too. :)

The task here is to figure out how to include a figure for Words Added

I think this task needs to say more on the work we're expecting from this. Is this an investigation? (How complex it is to do this and what's the accuracy for a few given examples?) or does this task actually involve adding the metric in?

jmatazzoni renamed this task from Put 'Words added' into 'Event Summary' and 'Pages Created' reports to Investigate hot to put 'Words added' into 'Event Summary' and 'Pages Created' reports.Oct 26 2018, 10:14 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni renamed this task from Investigate hot to put 'Words added' into 'Event Summary' and 'Pages Created' reports to Investigate hot to put 'Words added' into 'Event Summary' and 'Pages Created' reportsw.EditedOct 30 2018, 6:59 PM
jmatazzoni renamed this task from Investigate hot to put 'Words added' into 'Event Summary' and 'Pages Created' reportsw to Investigate how to put 'Words added' into 'Event Summary' and 'Pages Created' reports.

From a user on wiki

In English (I can't speak for other languages), word count is a very tradition metric for writing. University students get asked to write 1000 word essays etc. So I am in favour with newbies in particular (and their employers where relevant) getting that metric because it's something they understand and may be comfortable with including in their own reporting (particularly institutions). Byte count is something we can always do easily but is less meaningful to people. I think it's OK just to report byte count for languages where word count isn't meaningful or is expensive computationally. I don't think any word count needs to be perfect (we are all well trained by Microsoft Word to accept without question whatever number of words it claims to be in a document) so I would be happy with using an average number of bytes per word in that language as a quick-and-dirty metric which is trivial to calculate if it would be too computationally expensive to do it "properly".

PHP does have a word count function. Sure, it won't skip wikitext but it would be consistently wrong and maybe that would suffice?

http://php.net/manual/en/function.str-word-count.php

Here's another comment from a user who has a similar perspective about accuracy.

'Wikidata claims added' Some kind of ballpark figure would be more interesting than nothing at all. While our emphasis is on learning to edit, and not on Wikidata, it is sometimes interesting for people to see they have an edit count on WikiData without editing it directly, just from connecting articles on language wikis. While I am not a big fan of 'edit-count-itis' either, an edit count or something similar might give some minimal information.