Page MenuHomePhabricator

How often are new editors involved in edit conflicts
Closed, ResolvedPublic

Description

Find out how many of the edit conflicts were shown to users with

  • 0
  • 1- 10
  • 11-100
  • 101- 200
  • >=201

edits

  • show both the concrete numbers, and a percentage, and the number of edit conflicts and time period this is based on.

Event Timeline

Change 397762 had a related patch set uploaded (by Andrew-WMDE; owner: Andrew-WMDE):
[mediawiki/extensions/TwoColConflict@master] How often are new editors involved in edit conflicts

https://gerrit.wikimedia.org/r/397762

Change 395733 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/TwoColConflict@master] Eventlogging for conflicts

https://gerrit.wikimedia.org/r/395733

Change 397762 abandoned by Andrew-WMDE:
How often are new editors involved in edit conflicts

Reason:
Covered in: https://gerrit.wikimedia.org/r/#/c/395733

https://gerrit.wikimedia.org/r/397762

Change 395733 merged by jenkins-bot:
[mediawiki/extensions/TwoColConflict@master] Eventlogging for conflicts

https://gerrit.wikimedia.org/r/395733

Change 398042 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/TwoColConflict@wmf/1.31.0-wmf.12] Eventlogging for conflicts

https://gerrit.wikimedia.org/r/398042

Change 398042 abandoned by Addshore:
Eventlogging for conflicts

https://gerrit.wikimedia.org/r/398042

Change 398042 restored by Addshore:
Eventlogging for conflicts

https://gerrit.wikimedia.org/r/398042

Change 398042 abandoned by Addshore:
Eventlogging for conflicts

Reason:
Will just wait for the train to run this week.

https://gerrit.wikimedia.org/r/398042

@Addshore What is the status of the metrics that we need for this task?

This will again be using data from https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict which provides you with the editCount for the user.

@Lea_WMDE @Addshore

The description of the task says this should be adding additional rows to the Grafana dashboard.

I am not sure whether https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict goes to Graphite, but I don't think so.

So, I can get this from the Schema:TwoColConflictConflict, but I don't think I can serve this on Grafana. Please advise.

I am not sure whether https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict goes to Graphite, but I don't think so.

No the event logging data is not in graphite in any way.
However you can process the data from event logging and send more data into graphite.

For example, a random script scraping a number from twitter and sending it to graphite: https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/social/twitter.php#L24
The call that script actually makes to get the data to graphite https://github.com/wikimedia/analytics-wmde-scripts/blob/master/lib/WikimediaGraphite.php#L10
The protocol is defined @ https://graphite.readthedocs.io/en/latest/feeding-carbon.html#the-plaintext-protocol

@Lea_WMDE While writing the event logging for this task we assumed that this data would be shown by the side of whatever dashboard is created for the paragraphs ticket.
If this is not the case then we can probably make another change to the extension and track this data directly within the mediawiki side of things and send it straight to graphite.

@Addshore let's estimate the work to bring it onto the original grafana board ("mediawiki-twocolconflict) tomorrow. If I forget, please remind me :)

@Addshore Of topic: Interesting, you're not using the Twitter Search API to get to the number you need (e.g. number of followers, likes, re-tweets and similar)?

@Addshore Of topic: Interesting, you're not using the Twitter Search API to get to the number you need (e.g. number of followers, likes, re-tweets and similar)?

No, apparently not.

@Addshore let's estimate the work to bring it onto the original grafana board ("mediawiki-twocolconflict) tomorrow. If I forget, please remind me :)

Okay! This should only be a small task, "3"?

@Addshore @Lea_WMDE @Tobi_WMDE_SW

In order to complete this task we must be able to perform the following:

(1) assume some user A, who has made, say, < 100 edits;
(2) count the number of (2a) page views, (2b) resolved conflicts, and (2c) calculate the resulting percentage for both TwoColConflict and the current resolution page, accounted for by user A;
(3) and then iterate over users A, B, C,.. etc. in order to get the data for aggregation -> Graphana.

Now: 2a we can have from the https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict table, however, we will not get to 2b from MariaDB replicas, and I am pretty sure that we will not find 2b in Graphite either, which is probably in line with the following comment by @Addshore T181704#3915924

@Lea_WMDE @Addshore @Tobi_WMDE_SW

The results are shared on Google Drive.

@Addshore You are missing some ids in https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict. I don't know if this is relevant for you at all, but if it is indicative of anything, I can let you know exactly what ids are skipped.

@Addshore You are missing some ids in https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict. I don't know if this is relevant for you at all, but if it is indicative of anything, I can let you know exactly what ids are skipped.

Which Ids?

@Lea_WMDE @Addshore @Tobi_WMDE_SW

The results are shared on Google Drive.

Looking at the data I see no reason that we can't publish this, it might be better on a wikipage somewhere instead of a locked down google drive.

@Addshore So, when the data were fetched from https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict, max(id) was 4777, and the following ids were skipped: 133, 2332, 2433, 4523. To perform a control for this result, this morning: max(id) = 4816, however, count(id) = 4812. Hope this helps.

@Addshore So, when the data were fetched from https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict, max(id) was 4777, and the following ids were skipped: 133, 2332, 2433, 4523. To perform a control for this result, this morning: max(id) = 4816, however, count(id) = 4812. Hope this helps.

Interesting, it might be worth poking the Analytics team about this.

MediaWiki-extensions-EventLogging Hey, please check out T182008#3966356.

Background:

To replicate, compare the following results:

select max(id) from log.TwoColConflictConflict_17520555;

+---------+
| max(id) |
+---------+
|    4824 |
+---------+

select count(id) from log.TwoColConflictConflict_17520555;

+-----------+
| count(id) |
+-----------+
|      4820 |
+-----------+

The following ids were skipped: 133, 2332, 2433, 4523.

Even with the missing ids, is the data reliable? If yes, then yay! Great work Goran! This looks really interesting!
The only other thing that is confusing to me: You write there were 4773 observations in the data, but I would have expected a 15 times higher number with roughly 2000 edit conflicts per day, and a month of observation data. This can't all be anonymous users, can it?

(And for everybody who cannot see the results just yet and in case they are already trustworthy: 21% of edit conflicts are shown to users with < 10 edits (n=2532, looked at roughly Jan 2018)

@Lea_WMDE

The only other thing that is confusing to me: You write there were 4773 observations in the data, but I would have expected a 15 times higher number with roughly 2000 edit conflicts per day, and a month of observation data. This can't all be anonymous users, can it?

The following, run from stat1005: mysql:research@analytics-slave.eqiad.wmnet [log]> select count(*) from log.TwoColConflictConflict_17520555;, results in:

+----------+
| count(*) |
+----------+
|     5109 |
+----------+
1 row in set (0.00 sec)

That is all the data that there are in our log database on TwoColConflict.

The missing IDs, given that there are four of them, imply four missing data points.

Besides that, the integrity of this data set should be perfect; I do not remember having encountered any trouble there, nothing looked suspicious, all nice and tidy.

@Addshore Can you think of a reason why we have 5109 data points for a one-month time frame and not 2000*30 ?

@Addshore Can you think of a reason why we have 5109 data points for a one-month time frame and not 2000*30 ?

Why are we expecting 2000*30?

mysql:research@analytics-slave.eqiad.wmnet [log]> select DATE_FORMAT(timestamp, '%d %m %Y') as date, count(*) as count from log.TwoColConflictConflict_17520555 group by date order by timestamp;
+------------+-------+
| date       | count |
+------------+-------+
| 03 01 2018 |     1 |
| 04 01 2018 |    45 |
| 05 01 2018 |    85 |
| 06 01 2018 |    93 |
| 07 01 2018 |   106 |
| 08 01 2018 |    99 |
| 09 01 2018 |   101 |
| 10 01 2018 |   175 |
| 11 01 2018 |   125 |
| 12 01 2018 |   127 |
| 13 01 2018 |   110 |
| 14 01 2018 |   111 |
| 15 01 2018 |   140 |
| 16 01 2018 |   137 |
| 17 01 2018 |   143 |
| 18 01 2018 |   140 |
| 19 01 2018 |   146 |
| 20 01 2018 |    84 |
| 21 01 2018 |   102 |
| 22 01 2018 |   171 |
| 23 01 2018 |   145 |
| 24 01 2018 |   116 |
| 25 01 2018 |   173 |
| 26 01 2018 |    88 |
| 27 01 2018 |    95 |
| 28 01 2018 |    99 |
| 29 01 2018 |   117 |
| 30 01 2018 |   203 |
| 31 01 2018 |   141 |
| 01 02 2018 |   133 |
| 02 02 2018 |   117 |
| 03 02 2018 |    87 |
| 04 02 2018 |    95 |
| 05 02 2018 |   116 |
| 06 02 2018 |   136 |
| 07 02 2018 |   158 |
| 08 02 2018 |   128 |
| 09 02 2018 |   106 |
| 10 02 2018 |    78 |
| 11 02 2018 |    91 |
| 12 02 2018 |   125 |
| 13 02 2018 |   118 |
| 14 02 2018 |   110 |
| 15 02 2018 |   125 |
| 16 02 2018 |    68 |
+------------+-------+
45 rows in set (0.01 sec)

Note there is an interesting trend with less edit conflicts on the weekend :)

I should have written 30 * 1500, but what I mean is: We know we have way more edit conflicts per day (roughly 1500), and we looked at > 30 days of data. If you look at https://grafana.wikimedia.org/dashboard/db/mediawiki-edit-conflicts?refresh=1m&orgId=1&var-interval=1d&from=now-1M%2FM&to=now-1M%2FM it is a bit weird to have only 5000 data points

Interesting, so there is https://meta.wikimedia.org/wiki/Schema:EditConflict and the log table for that schema seems to have many more entries than the TwoColConflict log table.

For example for this month so far:

| 01 02 2018 |  1817 |
| 02 02 2018 |  1633 |
| 03 02 2018 |  1717 |
| 04 02 2018 |  1642 |
| 05 02 2018 |  1798 |
| 06 02 2018 |  1898 |
| 07 02 2018 |  1848 |
| 08 02 2018 |  1880 |
| 09 02 2018 |  1605 |
| 10 02 2018 |  1702 |
| 11 02 2018 |  1696 |
| 12 02 2018 |  1797 |
| 13 02 2018 |  1951 |
| 14 02 2018 |  1841 |
| 15 02 2018 |  1741 |
| 16 02 2018 |   892 |
+------------+-------+
91 rows in set (0.19 sec)

Both sets of event logging use the "EditPageBeforeConflictDiff" mediawiki hook to add their logging.
Something is fishy...

So, it looks like we cannot record the text of the edit conflict in event logging, at least not with the default auto created schema, as there is a limit (varchar(1024))...

mysql:research@analytics-slave.eqiad.wmnet [log]> describe log.TwoColConflictConflict_17520555;
+---------------------------+---------------+------+-----+---------+-------+
| Field                     | Type          | Null | Key | Default | Extra |
+---------------------------+---------------+------+-----+---------+-------+
| id                        | int(11)       | NO   | PRI | NULL    |       |
| uuid                      | char(32)      | YES  | UNI | NULL    |       |
| dt                        | datetime      | YES  | MUL | NULL    |       |
| timestamp                 | varchar(14)   | YES  | MUL | NULL    |       |
| userAgent                 | varchar(1024) | YES  |     | NULL    |       |
| webHost                   | varchar(1024) | YES  |     | NULL    |       |
| wiki                      | varchar(1024) | YES  |     | NULL    |       |
| event_baseRevisionId      | bigint(20)    | YES  |     | NULL    |       |
| event_editCount           | bigint(20)    | YES  |     | NULL    |       |
| event_isAnon              | tinyint(1)    | YES  |     | NULL    |       |
| event_pageNs              | bigint(20)    | YES  |     | NULL    |       |
| event_parentRevisionId    | bigint(20)    | YES  |     | NULL    |       |
| event_textUser            | varchar(1024) | YES  |     | NULL    |       |
| event_twoColConflictShown | tinyint(1)    | YES  |     | NULL    |       |
+---------------------------+---------------+------+-----+---------+-------+
14 rows in set (0.00 sec)

mysql:research@analytics-slave.eqiad.wmnet [log]> select length(event_textUser) as mLen from log.TwoColConflictConflict_17520555 ORDER BY mLen DESC limit 1;
+------+
| mLen |
+------+
|  880 |
+------+
1 row in set (0.00 sec)

So it looks like all of the cases where the text is longer than 1000 the entry will not be created.

I'm surprised that this didn't get spotted by some logging somewhere in the event logging setup, but perhaps the system doesn't log too thoroughly.

It could also be that the hadoop cluster, which also holds event logging data, has the full data set for our schema.

@Addshore I want to take care about T182011 first, given that T180571 is probably done now. Then I will get back to you in relation to this. Thanks for spotting the event logging related constraints, I'll also have a look on whether the data in Hadoop look any different.

  • This will be solved from the EditConflict Schema in combination with joints across other relevant SQL tables.
  • Why this will take some time: well, the log.EditConflict_8860941 table is centralized and found on the analytics-slave.eqiad.wmnet server, while the user tables from which we need to fetch the number of user edits are on the analytics-store.eqiad.wmnet server, and organized by wikies. (Hint: the previously used log.TwoColConflictConflict_17520555 table included the number of user edits, while the current log.EditConflict_8860941 table does not). This implies that I need to fetch the data from log.EditConflict_8860941first, and then orchestrate SQL calls to the respective user tables from R, choosing the database on analytics-store.eqiad.wmnet in accordance to what is found in the wiki field of the log.EditConflict_8860941 table in every SQL query, previously grouping the uuids per wiki.

@GoranSMilovanovic you can query either wmf_raw.mediawiki_user or wmf.mediawiki_user_history and wmf.mediawiki_history in Hive/Spark to get what you need.

The mediawiki_<<name>> tables in the wmf_raw database are straight imports from mediawiki, performed monthly. The best part is that they have all wikis together and you can join to the EventLogging data in Hadoop (let me know if you don't know about JsonRefine, it could be that there's already a table sitting on top of your data). The _history tables in the wmf database are processed for analytics purposes, documentation is here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits

@Milimetric Thank you, Dan. The JsonRefine is what is described here, I guess?

yes, @GoranSMilovanovic, that's it. But my main point was about the mediawiki tables, definitely look at those before you try something else, they're worth your time.

@Milimetric Will do. I've already made some use of the wmf.mediawiki_history table, but now I see there's a plenty of new tables there already.

However, as the latest snapshot of wmf.mediawiki_history is 2018-0`, I will have to complete this task in SQL. and R It's not too demanding (~150,000 rows to analyze).

@Lea_WMDE @Addshore The report is now updated. The split by TwoColConflict being show or not is not possible anymore, because the log.EditConflict_8860941 does not encompass the respective field. @Milimetric Thanks for the hint, in fact I have used the wmf.mediawiki_history to collect the user edit counts.

@GoranSMilovanovic awesome, thanks! This is really interesting :) I post the results here as well, so they are visible, but we should make a nice writeup of it as well.
You write that there was a considerable amount of 0 revision users. Can we split this up as well? I'll add it to the task description now, but if you prefere a seperate ticket, let me know!

Bildschirmfoto 2018-02-23 um 12.40.49.png (958×1 px, 145 KB)

@Lea_WMDE No need for a new ticket, I will split the same Report into two sections (including w. not including users with 0 revisions).

N.B. Can you please check with someone from the team working on this extension if it is possible at all for a user to have (A) no edits at all, and still (B) be registered as involved in an edit conflict? Theoretically - but this is just me guessing - that could happen at the time of user's first edit (e.g. first attempt to make an edit not being recorded as a revision because it instantly triggered an edit conflict).

@GoranSMilovanovic cool! I updated the task description once again since I realized it is ambiguous (but I think you are doing this already anyways). I will discuss with people what it could mean that people without an edit seem to be part of an edit conflict so often. I can tell you already now, that yes, if you run into an edit conflict, you have not yet created a persisting revision. Only the result of the edit conflict resolution would be an edit, but at the moment we count the edits, this has not happened yet.

@Lea_WMDE The report is updated again (Section A1: Find out how many of the edit conflicts were shown to users after removing users with zero revisions from the data set).

@GoranSMilovanovic cool, thanks! Could you also add a "0 edits" column to the A table?

@Lea_WMDE I guess you mean "add a row for the '0 edits' category". Update ready.