Page MenuHomePhabricator

Schema:Edit seems to incorrectly set users as anonymous {lion}
Closed, ResolvedPublic

Description

It seems more users are tagged as anonymous in the Schema:Edit instrumentation than are actually anonymous. Of all events, 78% have user.class = IP. Of all saveSuccess events, 57% have user.class = IP. I looked quickly to see if these percentages change over time, and I didn't see any significant change.

UPDATE: Analytics needs to analyze in depth exactly what's happening in the data, as the Editing team does not know where to look for problems. The code they use seems simple and some unexpected user behavior might be explaining this data (for example, a lot of people logging in during their edit).

Event Timeline

Milimetric assigned this task to Krenair.
Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a project: VisualEditor.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2015, 3:19 AM
Krenair set Security to None.Mar 13 2015, 6:44 PM
Krenair edited subscribers, added: Jdforrester-WMF; removed: Jdforrester-PERSONAL.

Dan, is this still an issue? From our IRC conversation I got the impression that this was a misunderstanding.

Sorry for the late reply. The misunderstanding was that user.class=IP and user.id=0 did not seem to match at first. They do, in fact match.

However, this issue is that 78% of events have user.class=IP and 57% of saveSuccess events have user.class=IP. This was present the last time I checked the data, though instrumentation problems mean we had some events not making it into the database. I would consider this issue resolved if we can determine that we are correctly labeling users as "Anonymous". Because 57% of all successful saves with Visual Editor seems like way too high of a number from all of our collective intuition. I will edit the description of the issue to make it less confuzzling.

Milimetric updated the task description. (Show Details)Mar 23 2015, 3:28 PM

I think it's unlikely that we would say we are definitely doing it correctly. I can think of nothing wrong with the code at the moment. In both cases, we just do this:

		if ( mw.user.isAnon() ) {
			event['user.class'] = 'IP';
		}

Are some of those users changing whether they are logged in or out half way through the session? How many users do that? I think it's possible.

Milimetric renamed this task from Schema:Edit seems to incorrectly set users as anonymous. to Schema:Edit seems to incorrectly set users as anonymous {lion}.Apr 7 2015, 10:32 PM
Milimetric updated the task description. (Show Details)
Milimetric added a project: Analytics-Kanban.
Halfak added a subscriber: Halfak.Apr 10 2015, 9:37 PM

I just checked the rates at which we see weirdness. It looks like 0.7% of edits saved by registered editors have saveSuccess events associated with user.class = "IP". 99.5% of revisions saved by logged in users do not have user.class = "IP".

> select rev_user = 0, sum(`event_user.id` = "IP"), COUNT(*) from Edit_11448630 INNER JOIN enwiki.revision ON rev_id = `event_page.revid` WHERE wiki = 'enwiki' AND timestamp BETWEEN "20150401" AND "20150402" AND event_action = "saveSuccess" GROUP BY 1;
+--------------+-----------------------------+----------+
| rev_user = 0 | sum(`event_user.id` = "IP") | COUNT(*) |
+--------------+-----------------------------+----------+
|            0 |                         628 |    82642 |
|            1 |                       27063 |    27198 |
+--------------+-----------------------------+----------+
2 rows in set, 1 warning (4 min 39.80 sec)

I also noticed that the user_id stored in the revision table sometimes doesn't match the user.id in the schema. This happens 1.5% of the time when the saved edit has rev_user != 0. It happens substantially less often when the user who saved an edit was logged out.

> select rev_user = 0, sum(rev_user != `event_user.id`), COUNT(*) from Edit_11448630 INNER JOIN enwiki.revision ON rev_id = `event_page.revid` WHERE wiki = 'enwiki' AND timestamp BETWEEN "20150401" AND "20150402" AND event_action = "saveSuccess" GROUP BY 1;
+--------------+----------------------------------+----------+
| rev_user = 0 | sum(rev_user != `event_user.id`) | COUNT(*) |
+--------------+----------------------------------+----------+
|            0 |                             1244 |    82642 |
|            1 |                              135 |    27198 |
+--------------+----------------------------------+----------+
2 rows in set (5 min 28.93 sec)

These rates are low, so they probably won't be a show-stopper for the upcoming test, but it would be good to know what causes them.

When looking for stuff like this, it's good to keep in mind that visual editor and wikitext instrumentation are very different, and to look at them separately. Here are the numbers for just VE:

mysql:research@analytics-store.eqiad.wmnet [log]> select rev_user = 0, sum(`event_user.id` = "IP"), COUNT(*) from Edit_11448630 INNER JOIN enwiki.revision ON rev_id = `event_page.revid` WHERE wiki = 'enwiki' AND timestamp BETWEEN "20150401" AND "20150402" AND event_action = "saveSuccess" and event_editor = 'visualeditor' GROUP BY 1;
+--------------+-----------------------------+----------+
| rev_user = 0 | sum(`event_user.id` = "IP") | COUNT(*) |
+--------------+-----------------------------+----------+
|            0 |                           0 |      796 |
|            1 |                           0 |       81 |
+--------------+-----------------------------+----------+
2 rows in set, 1 warning (6.29 sec)

mysql:research@analytics-store.eqiad.wmnet [log]> select rev_user = 0, sum(rev_user != `event_user.id`), COUNT(*) from Edit_11448630 INNER JOIN enwiki.revision ON rev_id = `event_page.revid` WHERE wiki = 'enwiki' AND timestamp BETWEEN "20150401" AND "20150402" AND event_action = "saveSuccess" and event_editor = 'visualeditor' GROUP BY 1;
+--------------+----------------------------------+----------+
| rev_user = 0 | sum(rev_user != `event_user.id`) | COUNT(*) |
+--------------+----------------------------------+----------+
|            0 |                              305 |      796 |
|            1 |                               81 |       81 |
+--------------+----------------------------------+----------+
2 rows in set (47.68 sec)

Indeed. Thanks @Milimetric. Here's a query that summarizes the problem.

> SELECT
    ->   event_editor,
    ->   rev_user = 0, 
    ->   sum(`event_user.id` = "IP"), 
    ->   COUNT(*) 
    -> FROM Edit_11448630 
    -> INNER JOIN enwiki.revision ON rev_id = `event_page.revid` 
    -> WHERE wiki = 'enwiki' AND 
    ->       timestamp BETWEEN "20150401" AND "20150402" AND 
    ->       event_action = "saveSuccess"
    -> GROUP BY 1,2;

+--------------+--------------+-----------------------------+----------+
| event_editor | rev_user = 0 | sum(`event_user.id` = "IP") | COUNT(*) |
+--------------+--------------+-----------------------------+----------+
| visualeditor |            0 |                           0 |      796 |
| visualeditor |            1 |                           0 |       81 |
| wikitext     |            0 |                         627 |    81654 |
| wikitext     |            1 |                       27033 |    27087 |
+--------------+--------------+-----------------------------+----------+
4 rows in set, 1 warning (8 min 57.37 sec)
Krenair removed Krenair as the assignee of this task.Apr 15 2015, 1:02 AM
Krenair added a subscriber: Krenair.

Note: Discussed at the weekly Editing triage on 2015-05-13 but not accepted or rejected whilst we follow-up with the Data Research and Analytics Engineering teams to check on the status of this issue.

Halfak closed this task as Resolved.May 13 2015, 11:19 PM
Halfak claimed this task.

Upon re-review with DAndreescu, it looks like this field is set appropriately. See my worklog here: https://meta.wikimedia.org/wiki/Research_talk:VisualEditor%27s_effect_on_newly_registered_editors/Work_log/2015-04-29

Thanks Halfak, the numbers are crazy, but crazy good! :)

kevinator moved this task from Next Up to Done on the Analytics-Kanban board.May 19 2015, 7:36 PM
Jdforrester-WMF moved this task from Bug Fixes to Q4 on the VisualEditor board.Jun 17 2015, 11:30 PM