Page MenuHomePhabricator

mediawiki_history datasets have null user_text for IP edits
Closed, ResolvedPublic3 Story Points

Description

For IP editors, the user_text fields are set to null. They should actually contain the IP address; the row needs to contain the IP address somewhere (since that is the user's name), and this is logical place.

select
    event_user_text,
    count(*) as edits
from wmf.mediawiki_history
where
    event_user_id = 0 and
    snapshot = "2018-09" and
    event_timestamp >= "2018-08"
group by event_user_text

  event_user_text    edits
0            None  4096169

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptOct 12 2018, 9:27 PM

The event_user_text field in wmf.mediawiki_history is what can be called the "current" value of the username, by opposition to the value it had at the time the edit was made which is stored in event_user_text_historical. Since IPs are stored at edit-time and we don't build user-history for IPs, the values are stored in `event_user_text_historical:

select count(distinct event_user_text) from wmf.mediawiki_history where snapshot='2018-09' and event_entity = 'revision' and event_user_is_anonymous and SUBSTR(event_timestamp, 0, 7) >= '2018-08'
                    
count(DISTINCT event_user_text)
                              0


select count(distinct event_user_text_historical) from wmf.mediawiki_history where snapshot='2018-09' and event_entity = 'revision' and event_user_is_anonymous and SUBSTR(event_timestamp, 0, 7) >= '2018-08'
                    
count(DISTINCT event_user_text_historical
                                   1286565

Something else to notice @Neil_P._Quinn_WMF : Don't forget to filter by event_entity, or you'll get more than edit-counts!

JAllemandou closed this task as Declined.Oct 15 2018, 7:40 AM

The event_user_text field in wmf.mediawiki_history is what can be called the "current" value of the username, by opposition to the value it had at the time the edit was made which is stored in event_user_text_historical. Since IPs are stored at edit-time and we don't build user-history for IPs,

I have to confess that I don't quite understand this statement. (MediaWiki itself is most definitely building contribution histories for IPs too, and providing them under [[Special:Contributions]].)
But be that as it may - shouldn't the documentation of mediawiki_history be updated in that case? At present it claims that the event_user_text field contains the "Current username or IP address of the of the [sic] user that caused the event".

the values are stored in `event_user_text_historical:

...

But be that as it may - shouldn't the documentation of mediawiki_history be updated in that case?

Done !

Neil_P._Quinn_WMF reopened this task as Open.EditedNov 16 2018, 1:51 AM

Hey @JAllemandou!

Thank you very much for updating the documentation!

Sorry to weigh in so late, but why can't we simply copy the event_user_text_historical to event_user_text for IP editors at the end of the reconstruction process? It doesn't seem like it would be hard to implement, and it's annoying and hard to learn that the way to get the canonical name for any user is not event_user_text like you'd expect, but rather coalesce(event_user_text, event_user_text_historical).

@Neil_P._Quinn_WMF I agree with that solution, seems fine to me. By the way, we might be having some trouble with user_text due to the actor normalization in mediawiki dbs, and we haven't had time to check thoroughly.

Hi @Neil_P._Quinn_WMF ,
While I understand the usage frustration, keeping the IPs in event_user_text_historical is for me a matter of data correctness.
Doing it represents nothing in term of code change, but I'd rather not do it to keep the semantics of event_user_text and event_user_text_historical valid and similar for anonymous and non-anonymous edits.
Happy to continue the discussion and to get other opnions weighting :)

Neil_P._Quinn_WMF added a comment.EditedNov 16 2018, 7:13 PM

Hi @Neil_P._Quinn_WMF ,
While I understand the usage frustration, keeping the IPs in event_user_text_historical is for me a matter of data correctness.
Doing it represents nothing in term of code change, but I'd rather not do it to keep the semantics of event_user_text and event_user_text_historical valid and similar for anonymous and non-anonymous edits.

Hmm. I don't actually understand this. To me, from a data semantics point of view, the current setup implies that IP users never have a current name—as if the username has been hidden and MediaWiki no longer attributes their edits to any name.

In reality, IP users do have a current name, which you'll see on every history page. It just so happens that unlike the name of a registered user, an IP user's name cannot be changed, so the current and historical values will always be the same.

So, as I see it, copying the IP address to event_user_text is more correct than the current setup.

I hear your point and it makes a lot of sense. I think our views differ in the notion of current name. In my world a current name is associated to events only when we're sure those events have been made by the same person/account. In my world the event_user_text field references single users. I however understand that taken from a purely name-changing perspective, an IP has the same value before and now. My concern lay in misrepresentations of IP changes in time: If you use an IP as current-name for an edit, you might be tempted to consider other edits made by that IPs as belonging to the same user - Which is false (but true in the case of non-anonymous edits).

fdans moved this task from Incoming to Blocked on the Analytics board.Nov 19 2018, 5:26 PM
fdans added a subscriber: fdans.

We won't be making any changes in mediawiki history in the near term since we're redefining the way we sqoop data.

Neil_P._Quinn_WMF added a comment.EditedNov 27 2018, 2:51 AM

I hear your point and it makes a lot of sense. I think our views differ in the notion of current name. In my world a current name is associated to events only when we're sure those events have been made by the same person/account. In my world the event_user_text field references single users. I however understand that taken from a purely name-changing perspective, an IP has the same value before and now. My concern lay in misrepresentations of IP changes in time: If you use an IP as current-name for an edit, you might be tempted to consider other edits made by that IPs as belonging to the same user - Which is false (but true in the case of non-anonymous edits).

Thanks for explaining your thinking, @JAllemandou! 😁

I understand now where you are coming from, but I honestly find that interpretation very unintuitive (on top of the usability problem with null values): before reading your comment, it never occurred to me that might be the reason for omitting event_user_text for IPs. I doubt anyone will understand from the null values that those events cannot be reliably attributed; instead they'll probably just be confused (particularly when the core MediaWiki databases have always used user_text to mean user name or IP address).

Moreover, this approach of using semantics to prevent data users from grouping edits by IP address implies that that approach is incorrect and should basically never be done. But a lot of Analytics products use that exact approach: for example, Wikistats 2 by default includes anonymous editors (i.e. distinct IP addresses) in the total, and the Geoeditors dataset did so until I suggested using the user agent as well.

So, overall, I'm definitely still for adding the IP address (whenever the blocking work is completed, of course).

kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Jan 3 2019, 9:27 PM

I hear your point and it makes a lot of sense. I think our views differ in the notion of current name. In my world a current name is associated to events only when we're sure those events have been made by the same person/account. In my world the event_user_text field references single users.

I understand that perspective, but the problem is that is incompatible with the perspective that has been taken for the last decade and half by the MediaWiki software (which e.g. groups edits from the same IP on the same special page) , the editing community, analysts, and the general public (example).

I however understand that taken from a purely name-changing perspective, an IP has the same value before and now. My concern lay in misrepresentations of IP changes in time: If you use an IP as current-name for an edit, you might be tempted to consider other edits made by that IPs as belonging to the same user - Which is false (but true in the case of non-anonymous edits).

So it boils down to protecting data analysts who access mediawiki_history from that fallacy. Assuming it isn't enough to rely on the assumption that they already possess this knowledge (which, again, is common to volunteer editors etc.), I think this goal could be achieved by adding a note to the table's documentation, rather than preventing them from applying the existing standard definition.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

// Current datasource - normally the problem is present in here
val odf = spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
odf.createOrReplaceTempView("mwh_old")

// New datasource - normally the problem is solved in here
val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
df.createOrReplaceTempView("mwh")

// Reproduce old
 spark.sql("""
     | select
     |     event_user_text,
     |     count(*) as edits
     | from mwh_old
     | where
     |     event_user_id = 0 and
     |     event_timestamp >= "2018-08"
     | group by event_user_text
     | """).show(10, false)
+---------------+--------+                                                      
|event_user_text|edits   |
+---------------+--------+
|null           |17206148|
+---------------+--------+


// Check new
spark.sql("""
     | select
     |     event_user_text,
     |     count(*) as edits
     | from mwh
     | where
     |     event_user_id = 0 and
     |     event_timestamp >= "2018-08"
     | group by event_user_text
     | """).show(10, false)
+--------------------------------------+-----+                                  
|event_user_text                       |edits|
+--------------------------------------+-----+
|112.248.13.26                         |28   |
|2A01:598:A90C:6D70:E1CE:1FF7:CE94:1255|4    |
|109.184.74.88                         |3    |
|2003:EB:C3D4:CD7:60E5:2BE0:B4A8:8377  |1    |
|213.26.251.61                         |1    |
|104.246.37.74                         |26   |
|140.120.220.3                         |1    |
|182.1.67.130                          |1    |
|39.37.181.27                          |4    |
|2A01:CB1D:80FC:DA00:D4BB:1FD9:BE1D:B6C|34   |
+--------------------------------------+-----+
Nuria set the point value for this task to 3.Tue, May 14, 8:37 PM
Nuria closed this task as Resolved.Tue, May 14, 8:45 PM