Page MenuHomePhabricator

Remove bugs from Analysis who is affected by edit conflicts.
Closed, ResolvedPublic

Description

Motivation
We want to make sure user ids really refer to the same users. We also want to see if users encounter edit conflicts in the very first edit they ever made.

Todo

  • join on User_Text (globally unique user name or IP address, so make sure to filter out not logged in users)
  • take all user ids that encountered an edit conflict in the past 3 months, and join them with mediawiki history-number of revisions ever made (cumulative edits). This will also include the 0 edit cases for edit conflicts.
  • take out the bot edits on the edit conflict side (using is_bot flag from the logging schema)

Notes
Should be done by the end of May 25th

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Lea_WMDE

Before I present the results here, the following must be taken into consideration:

  • We use all available data on edit conflicts and the latest 2018-04 snapshot of the wmf.mediawiki_history table.
  • Bots are filtered from both data sets before performing any join operations.
  • We look only for edits in the wmf.mediawiki_history Hadoop table of those users who have ever encountered an edit conflict.
  • The data upon which the result is based are partial because I still can't get Hive to understand the Chinese characters in user names. Because querying by HiveQL in Hadoop is performed in batches (this is necessary), and the Chinese user names go along with some other user names in the same batch, the data are not missing Chinese users only. Approximately 4,000 user edit data points out of approx. 35,000 users who were found in the edit conflicts data set are missing. I will see to fix this, however, at this point I still have no idea how. NOTE: This implies that the G0 category (see below) results are inflated.
  • We still find a lot of users who (a) are present in the edit conflicts data set, but (b) not in the wmf.mediawiki_history table from were we fetch their edit counts. These are the only two possible explanations for this that I can think of: (1) the wmf.mediawiki_history table is badly engineered (this is not the first and neither the sole problem that I have encountered in my work with this table) and simply does not encompass all users, or (2) the user names from the edit conflict data set (log database, SQL) do not match the user names in the wmf.mediawiki_history table. In any case, this is something I cannot help around.

Finally, the result:

  EditsGroup TotalConflicts Average Median Percentage
1 G0                  24196    2.69      1      21.5 
2 G1                   9331    1.43      1       8.31
3 G2                  11829    1.90      1      10.5 
4 G3                   3327    2.32      1       2.96
5 G4                  63666    5.38      2      56.7

Where G0 is 0 Edits or not found in the wmf.mediawiki_history table, G1 is 1 - 10 edits, G2 is 11 - 100 edits, G3 is 101 - 200 edits, and G4 is > 200 edits.

As of T195350: I strongly suggest not going for this until we understand why we do not find all the usernames from the log SQL database table on edit conflicts in the wmf.mediawiki_history table in Hadoop. Suggestion, if you agree: we ping the members of the Analytics-Engineering team (who manage the wmf.mediawiki_history) to see what do they think.

Hi @GoranSMilovanovic , yes please ping the people who know most about the table, so we can finally resolve the mystery.

Vvjjkkii renamed this task from Remove bugs from Analysis who is affected by edit conflicts. to ygcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii removed GoranSMilovanovic as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: GoranSMilovanovic; removed: Aklapper.
CommunityTechBot renamed this task from ygcaaaaaaa to Remove bugs from Analysis who is affected by edit conflicts..Jul 2 2018, 3:56 PM
CommunityTechBot assigned this task to GoranSMilovanovic.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: Aklapper; removed: GoranSMilovanovic.

@Lea_WMDE I have performed another re-work of this in order to make sure that we have tried everything we could before addressing the data engineering of the wmf.mediawiki_history table. Let me briefly summarize the situation for you before I contact @Milimetric for advise:

  • My utmost concern was related to some issues that I have encountered while trying to fetch the data on users whose event_user_text in wmf.mediawiki_history contains special characters. In order to make sure that we are not missing any user revision counts because of this problem in relation to special characters:
  • the revision counts of all users who have ever made a revision, and are self-made, are not bots, and not anonymous were collected and joined with the Edit Conflicts dataset.

Result: still we discover users who have encountered edit conflicts while having 0 (zero) revisions.

@Milimetric I hope you will find some time to provide an advise in relation to the dilemma that I am facing in relation to wmf.mediawiki_history. The problem will be described in an e-mail to you and @Lea_WMDE. I will need to share some private data (user names) with you, so please forgive me for I will ask you to confirm that you both have an NDA signed in that forthcoming e-mail. Thank you.

@Lea_WMDE

By allowing user to be not created by self ( removing event_user_is_created_by_self = true from the HiveQL query) and encompassing user revisions that were subsequently deleted (removing revision_is_deleted = false from the HiveQL query) I was able to lower the critical percent of those who have encountered an edit conflict without having any revisions to 2.08% only:

EditsGroup TotalConflicts Average Median Percentage
G0                   2232    1.33      1       2.08
G1                   8024    1.22      1       7.48
G2                   9372    1.44      1       8.74
G3                   2900    1.81      1       2.7 
G4                  84710    5.00      2      79.0

Once again, the EditGroups are the following:
G0 means 0 Edits or not found in the wmf.mediawiki_history table, G1 is 1 - 10 edits, G2 is 11 - 100 edits, G3 is 101 - 200 edits, and G4 is > 200 edits.

The HiveQL query that was run against the wmf.mediawiki_history table is the following:

SELECT wiki_db, event_user_id, count(*) as revision_count
  FROM wmf.mediawiki_history
  WHERE event_entity = 'revision' 
  AND event_type = 'create'
  AND event_user_is_bot_by_name = false
  AND event_user_is_anonymous = false
  AND NOT ARRAY_CONTAINS(event_user_groups, 'bot')
  AND snapshot = '2018-06' 
  GROUP BY wiki_db, event_user_id;

Please let me know whether this result now presents what we were looking for. The critical question to address is, of course, do we accept as possible the fact that 2.08% users have encountered an edit conflict and never made a revision. Thank you.

@GoranSMilovanovic coudl you add the full query here? Thanks!

@Lea_WMDE

By allowing user to be not created by self ( removing event_user_is_created_by_self = true from the HiveQL query) and encompassing user revisions that were subsequently deleted (removing revision_is_deleted = false from the HiveQL query) I was able to lower the critical percent of those who have encountered an edit conflict without having any revisions to 2.08% only:

EditsGroup TotalConflicts Average Median Percentage
G0                   2232    1.33      1       2.08
G1                   8024    1.22      1       7.48
G2                   9372    1.44      1       8.74
G3                   2900    1.81      1       2.7 
G4                  84710    5.00      2      79.0

Once again, the EditGroups are the following:
G0 means 0 Edits or not found in the wmf.mediawiki_history table, G1 is 1 - 10 edits, G2 is 11 - 100 edits, G3 is 101 - 200 edits, and G4 is > 200 edits.

Is this data from the EditConflict schema? (Or if it is from TwoColConflictConflict, does that schema capture all edit conflicts, or only those where the new two-column feature was used?)

The HiveQL query that was run against the wmf.mediawiki_history table is the following:

SELECT wiki_db, event_user_id, count(*) as revision_count
  FROM wmf.mediawiki_history
  WHERE event_entity = 'revision' 
  AND event_type = 'create'
  AND event_user_is_bot_by_name = false
  AND event_user_is_anonymous = false
  AND NOT ARRAY_CONTAINS(event_user_groups, 'bot')
  AND snapshot = '2018-06' 
  GROUP BY wiki_db, event_user_id;

I understand that to exclude bots as much as possible, one also needs to look at the account's historical user group ( event_user_groups_historical, cf. documentation).

@GoranSMilovanovic thanks for the query! Talking to @Tbayer the day before yesterday, we realized that by changing from user_text to user_id + wiki, we are changing the meaning of the results: What we want is the number of edits a user have made on any of the wikis, not the number of edits per wiki. With grouping by user_id and wiki, we count the same user multiple times. So either we go back to user_text (and deal with the fact that renamings are blurring the results), or we need to group the user_id+wiki results by the global user id of the user.

@Lea_WMDE Got it. I also think wiki + user_id necessarily multiplies the users in the analysis. I'm on it, I just need to fix something for Wiktionary analytics first.

@Lea_WMDE Ok, let's wrap this up.

@Tbayer As of the following suggestion

I understand that to exclude bots as much as possible, one also needs to look at the account's historical user group (event_user_groups_historical, cf. documentation).

  • I've tried to modify the query as suggested and tested with already existing edit conflicts data (schema: log.EditConflict_8860941 in order to compare: no changes in the result at all.
  • Upon updating the edit conflicts data set, and keeping event_user_groups_historical in the query as suggested, the percent of those who have encountered an edit conflict without having any revisions jumped from 2.08% to 4.04%; this could also be a consequence of the fact that while new data has arrived to log.EditConflict_8860941, the latest available wmf.mediawiki_history table snapshot is still 2018-06.

So the event_user_groups_historical field doesn't help much, but I will keep in the query since it doesn't hurt to have it there.

@Lea_WMDE

As of the following:

So either we go back to user_text (and deal with the fact that renamings are blurring the results), or we need to group the user_id+wiki results by the global user id of the user.

We cannot use any global user ids to perform matches: the wmf.mediawiki_history table uses only local (i.e. wiki-specific) user ids. Even if we decide to skip the wmf.mediawiki_history table and go for SQL revision tables solely, we could not do it because the revision schema also does not store any global user ids (putting aside the fact that it would take forever to do this via SQL; that is the reason why we have wmf.mediawiki_history in the Data Lake in the first place). So,

  • either we keep the results that we have and decide to live with the noise in the data produced by the fact that the wiki + user_id composite multiplies the users, OR
  • we go back to using user_text and decide to accept any noise in the data that the uncertainty related to changes there brings about.

@Lea_WMDE Your call.

@GoranSMilovanovic there is another table (whose name I don't know, @WMDE-Fisch can you help?), where the local IDs per wiki are matched to the global user ID. But I would also be ok with living with the disadvantages of user_text, they are by far less distorting than counting every person per wiki as a new person. So for the sake of speed, let's go with user_text :)

@Lea_WMDE So we're back to the initial solution. Will do.
@WMDE-Fisch I would really love to learn about that table. Even if we don't make use of it now, we will in the future to try to avoid problems similar to those encountered here. Thank you!

Summary of our talk right now: We keep everything we have right now. The only thing to do is change from local user id + wiki to user_text!

@Lea_WMDE

The only thing to do is change from local user id + wiki to user_text!

Here we go:

  EditsGroup TotalConflicts Average Median Percentage
1 G0                   4030    1.47      1       4.13
2 G1                   3798    1.22      1       3.89
3 G2                   5887    1.49      1       6.03
4 G3                   2213    1.81      1       2.27
5 G4                  81639    5.11      2       83.7
  • G0 means 0 Edits or not found in the wmf.mediawiki_history table
  • G1 is 1 - 10 edits
  • G2 is 11 - 100 edits
  • G3 is 101 - 200 edits
  • G4 is > 200 edits.