Track how many paragraphs usually are involved in an edit conflict
Closed, DeclinedPublic

Description

Motivation
In order to better adapt the design of the edit conflict resolution screen, it would be great to know how many paragraphs usually are involved in an edit conflict. So we want to be able to see both mean and median numbers, and ideally a graph showing the distribution

Task
Track the number of paragraphs that need resolution per edit conflict.
This can be a "one time" information, over the data received so far with the event logging.

Please note the number of conflicts (n) that the mean/median are based on.

Lea_WMDE moved this task from Proposed to Todo on the WMDE-QWERTY-Team board.

Change 395733 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/TwoColConflict@master] WIP Eventlogging for conflicts

https://gerrit.wikimedia.org/r/395733

Addshore claimed this task.Dec 7 2017, 5:30 PM
Restricted Application added a project: User-Addshore. · View Herald TranscriptDec 7 2017, 5:30 PM

The patch adds event logging which should provide us with the information that we need to then calculate this.

@GoranSMilovanovic might be able to help us out once we have the data.

Addshore moved this task from Backlog to Needs Review on the User-Addshore board.Dec 11 2017, 4:06 PM

Change 395733 merged by jenkins-bot:
[mediawiki/extensions/TwoColConflict@master] Eventlogging for conflicts

https://gerrit.wikimedia.org/r/395733

Change 398042 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/TwoColConflict@wmf/1.31.0-wmf.12] Eventlogging for conflicts

https://gerrit.wikimedia.org/r/398042

@Addshore @Tobi_WMDE_SW What remains to be done here? - If there's an eventlogging table (SQL or Hive) that now needs to receive analytics and reporting, please let me know.

Lea_WMDE moved this task from Incoming to Analytics Tasks on the TCB-Team board.Dec 19 2017, 9:05 AM

Change 398042 abandoned by Addshore:
Eventlogging for conflicts

https://gerrit.wikimedia.org/r/398042

Change 398042 restored by Addshore:
Eventlogging for conflicts

https://gerrit.wikimedia.org/r/398042

Change 398042 abandoned by Addshore:
Eventlogging for conflicts

Reason:
Will just wait for the train to run this week.

https://gerrit.wikimedia.org/r/398042

Addshore added a comment.EditedJan 12 2018, 10:31 AM

Data can be found in event logging in TwoColConflictConflict_17520555

https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict

Addshore removed Addshore as the assignee of this task.Jan 12 2018, 10:31 AM
Addshore added a subscriber: Addshore.
Lea_WMDE updated the task description. (Show Details)Jan 12 2018, 10:36 AM
GoranSMilovanovic added a comment.EditedJan 14 2018, 11:45 AM

@Addshore @Lea_WMDE @Tobi_WMDE_SW

There is insufficient data to complete this task.

Explanation:

The description of the task is: "In order to better adapt the design of the edit conflict resolution screen, it would be great to know how many paragraphs usually are involved in an edit conflict", and specifically:

"Track the number of paragraphs that need resolution per edit conflict, split by whether the conflict was resolved or not.
This can be a "one time" information, over the data received by the past 4 months"

In T181704#3896365, @Addshore says:

Data can be found in event logging in TwoColConflictConflict_17520555
https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict

However, nothing in the Schema:TwoColConflictConflict informs about

  • the number of paragraphs involved in a conflict (need to answer to "it would be great to know how many paragraphs usually are involved in an edit conflict";
  • whether the conflict was resolved or not (need to "split by whether the conflict was resolved or not", as provided in the task description).

This is the output of describe log.TwoColConflictConflict_17520555; from analytics-slave.eqiad.wmnet:

+---------------------------+---------------+------+-----+---------+-------+

FieldTypeNullKeyDefaultExtra

+---------------------------+---------------+------+-----+---------+-------+

idint(11)NOPRINULL
uuidchar(32)YESUNINULL
dtdatetimeYESMULNULL
timestampvarchar(14)YESMULNULL
userAgentvarchar(1024)YESNULL
webHostvarchar(1024)YESNULL
wikivarchar(1024)YESNULL
event_baseRevisionIdbigint(20)YESNULL
event_editCountbigint(20)YESNULL
event_isAnontinyint(1)YESNULL
event_pageNsbigint(20)YESNULL
event_parentRevisionIdbigint(20)YESNULL
event_textUservarchar(1024)YESNULL
event_twoColConflictShowntinyint(1)YESNULL

+---------------------------+---------------+------+-----+---------+-------+

The semantics of the respective fields are provided on the Schema:TwoColConflictConflict page.

Please advise.

@Addshore If the idea is to go and parse the content of the revisions in order to figure the number of paragraphs there, that's not going to work from Graphite, if you agree.

However, nothing in the Schema:TwoColConflictConflict informs about

  • the number of paragraphs involved in a conflict (need to answer to "it would be great to know how many paragraphs usually are involved in an edit conflict";

This all needs to be computed from the data in the table.

The table contains the revision ID for both the current text on the page (conflicting text) and the user provided text.
These are the 2 texts that are conflicting and we should be able to figure out the paragraphs that conflict from this.
You could actually use the mediawiki api to compare the text and the revision https://en.wikipedia.org/w/api.php?action=help&modules=compare

  • whether the conflict was resolved or not (need to "split by whether the conflict was resolved or not", as provided in the task description).

The table contains the revid, which we can figure out the page from. We also have the user and the timestamp and we will have to figure this out on a best guess basis.
Unless we want to try and implement further tracking.

@Addshore If the idea is to go and parse the content of the revisions in order to figure the number of paragraphs there, that's not going to work from Graphite, if you agree.

I don't quite see where graphite comes into this?

From the description:

This can be a "one time" information, over the data received by the past 4 months

I see this more as a one time report.

GoranSMilovanovic added a comment.EditedJan 15 2018, 12:19 PM

@Addshore "I don't quite see where graphite comes into this? I see this more as a one time report." - Sorry, discard, my bad, I tend to see all TwoColConflict tasks as Grafana related, a consequence of spending too much time there anyways.

@Addshore Please let me know when we can have a Google Hangouts session on this task.

@Lea_WMDE @Tobi_WMDE_SW This is going to take some time to study before it can be resolved, if it can be resolved. At the moment, I cannot even see clearly the solution for this even in principle; it seems to me like if this is a research, "let's try and see if this can be done" kind of task. So, as soon as I can get my hands on this, I will try to develop what we need, but no guarantees that I will succeed in doing so. Let me illustrate, please.

For example, the following

The table contains the revid, which we can figure out the page from. We also have the user and the timestamp and we will have to figure this out on a best guess basis.

as a guide on how to answer to

whether the conflict was resolved or not (need to "split by whether the conflict was resolved or not", as provided in the task description).

does not sound promising at all. So, we have the revid, we have the user, and we have the timestamp, and that would imply that we can somehow know if the conflict was resolved or not? I am not sure. Maybe yes, maybe not. It will take some time before I start realizing how do the information provided combine to determine whether the conflict was resolved or not. I will do my best, but from the description as such, I don't even see a sketch of a general approach.

As of the counting the number of paragraphs involved in a conflict:

This all needs to be computed from the data in the table.
The table contains the revision ID for both the current text on the page (conflicting text) and the user provided text.
These are the 2 texts that are conflicting and we should be able to figure out the paragraphs that conflict from this.
You could actually use the mediawiki api to compare the text and the revision https://en.wikipedia.org/w/api.php?action=help&modules=compare

this sounds more promising. It looks like it can be solved by querying the MediaWiki API, fetching then HTML of the difference between the two revisions, and counting the number of paragraphs there.

@Addshore For the following two fields in Schema:TwoColConflictConflict:

  • baseRevisionId
  • parentRevisionId

could you please indicate which one maps to (a) current text on the page (conflicting text), and which one to (b) the user provided text? Thanks. I guess parentRevisionId == (a) and baseRevisionId == (b).

Stay in touch.

split by whether the conflict was resolved or not.

So for this it might be worth us sending this task back to the dev side and try to add some further eventlogging to report when we believe edit conflicts have been solved.

This would likely have to use one of the post save mediawiki hooks, but could probably include the username, revid of the new edit, and timestamp. (and possibly details of the previous conflict that happened, but I don't think that would actually still exist at that point.

@Addshore For the following two fields in Schema:TwoColConflictConflict:

baseRevisionId
parentRevisionId
could you please indicate which one maps to (a) current text on the page (conflicting text), and which one to (b) the user provided text? Thanks. I guess parentRevisionId == (a) and baseRevisionId == (b).

Neither of these refer to the user provided text, as that has not yet been saved and can not have an id.

baseRevision is "Current version when the edit was started" so the original base that the user was editing from
parentRevision is "the ID of the newer revision to which we have rebased this page" so the conflicting text that has been saved since the user started editing.

GoranSMilovanovic added a comment.EditedFeb 11 2018, 9:01 PM

@Addshore So, let's see: "Track the number of paragraphs that need resolution per edit conflict, split by whether the conflict was resolved or not." - I think that the first part of the question can be answered, while I am not sure for the second ("split") part.

Please let me know whether do you agree or not that (A) solves the first part of the question:

(A) "Track the number of paragraphs that need resolution per edit conflict"

  • collect all event_parentRevisionId and event_baseRevisionId from log.TwoColConflictConflict_17520555 for the last four months;
  • fetch both revisions and the differences between them from the mediawiki API;
  • search for these differences in the parentRevision revision text and count the number of paragraphs in which the differences occur.

(B) "... split by whether the conflict was resolved or not."

  • in order to provide for this, one must be able to match the event_parentRevisionId with the information on whether the respective revision has resulted in a conflict being resolved or not - and I am not the one who can say with certainty how that match could be made.
Lea_WMDE updated the task description. (Show Details)Feb 12 2018, 11:13 AM
Lea_WMDE updated the task description. (Show Details)
Lea_WMDE updated the task description. (Show Details)Feb 12 2018, 11:15 AM

@Addshore Thank you for the webHost field.

@Addshore Thank you for the webHost field.

The webHost is automatically added to all event logging records :)

I thought you can chose whether to include it in a particular schema or not? Thanks anyways, for some reasons (hint: contact which mediawiki API?) I needed that field badly yesterday.

GoranSMilovanovic added a comment.EditedFeb 15 2018, 3:15 AM

@Tobi_WMDE_SW @Addshore @Lea_WMDE This is going to be difficult, if at all possible, and I am thinking of offering you some proxy measures here.

Causes of difficulty: fetching revisions and inspecting them is trivial, however, the differences between revisions are delivered in JSON that sometimes cannot be parsed (i.e. causes Error: lexical error: invalid character inside string. errors in R; after removal, R reports back things like premature EOF etc.). Many revision comparisons simply return a prima facie error code ("There is no revision with ID...). Many are simply trivial, having * as their whole content. Finally, the difference is returned as an HTML in a JSON blob, while the content of the revisions is pure wikitext. And in the end, it is not even conceptually clear how to separate the elements of the difference, even if parsed correctly, to look for in the paragraphs of the revisions.

In other words, it looks like I would need to program a serious parsing machinery (and it is questionable whether would I even know how to program it) to do this.

My conclusion: I cannot develop a consistent methodology for this.

My suggestion: let me offer you a proxy measure from what we can have. Namely, I can count the number of paragraphs in the baseRevision and parentRevision, and then take their average (because, if you think thoroughly about what the user can do there, you will understand that none of the two numbers is more indicative of the number of conflicting paragraphs than the other).

Please let me know what you think.

Hi @GoranSMilovanovic thanks for the update! I don't think the number of paragraphs in the revisions is really helpful for me. Let me take this to the dev team's meeting on Tuesday (which you could join in person if you wanted to :) ) and see if they can send you some better data.

Many revision comparisons simply return a prima facie error code ("There is no revision with ID...).

Could you provide an example?

Which api module are you using?

@Addshore wrote:

You could actually use the mediawiki api to compare the text and the revision https://en.wikipedia.org/w/api.php?action=help&modules=compare

You don't need to fetch the revisions at all, you can make mediawiki do the comparison for you.

For example, something like this:

@Lea_WMDE It would be great if I could join the dev team meeting, could you please make me an invite? Thanks.

@Addshore That is exactly what I am doing. You've probably forgot about the procedure that I've proposed to solve this (and that I think you've agreed upon):

  1. Get baseRevisionID and parentRevisionID; compare with the mediawiki API compare module to get the difference;
  2. Fetch the content of baseRevisionID;
  3. Break up the content of baseRevisionID into paragraphs;
  4. Look up what parts of the difference are found in what baseRevisionID paragraphs and count the unique number of edited paragraphs.

The data set attached to a forthcoming e-mail (uuids are present, so no sharing here) is to compare with my complaints in T181704#3974553.

The metadata (i.e. the semantics of the columns) are forthcoming too.

Following the today's @Addshore and @GoranSMilovanovic catch-up on the Technical Wishlist task, the conclusion for this one is:

  • it will be solved by a direct textual comparison of the base and the parent revision from R, bypassing the mediawiki API compare module.
Addshore moved this task from Needs Review to Watching on the User-Addshore board.Feb 21 2018, 3:41 PM
Lea_WMDE changed the task status from Open to Stalled.Apr 12 2018, 8:54 AM
Lea_WMDE closed this task as Declined.Apr 27 2018, 7:59 AM

Comparing the effort this would take with the effect it would have, this does not make sense for us right now

Addshore moved this task from Watching to Done ✔️ on the User-Addshore board.May 3 2018, 8:31 AM