Page MenuHomePhabricator

Add caused_by_user_text to mediawiki_page_history
Closed, ResolvedPublic1 Story Points

Description

Re. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history

Add a new field cused_by_user_text which would contain the IP address in the case that a page is created by an IP editor. IPs can and do create articles in all wikis except English Wikipedia and there are many other types of pages that IPs create in English Wikipedia.

Event Timeline

Halfak created this task.Jun 11 2017, 7:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2017, 7:48 PM
Nuria added a subscriber: Nuria.EditedJun 12 2017, 3:46 PM

Is this about listing the IP in the page create events that have it?

Nuria triaged this task as Low priority.Jun 12 2017, 3:50 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.
fdans moved this task from Backlog (Later) to Incoming on the Analytics board.Nov 2 2017, 3:44 PM
fdans moved this task from Incoming to Dashiki on the Analytics board.Nov 6 2017, 5:04 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.Apr 2 2018, 3:33 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
fdans moved this task from Incoming to Backlog (Later) on the Analytics board.Apr 12 2018, 5:08 PM
Milimetric moved this task from Done to Ready to Deploy on the Analytics-Kanban board.
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/page_history/snapshot=2019-03").createOrReplaceTempView("mwph")

spark.sql("select caused_by_user_text, count(1) as c from mwph group by caused_by_user_text order by c desc limit 20").show(20, false)
+--------------------------+--------+                                           
|caused_by_user_text       |c       |
+--------------------------+--------+
|null                      |54186484|
|Lsjbot                    |17517304|
|Research Bot              |16512521|
|TuanminhBot               |11023605|
|Meta-Wiki Welcome         |8611228 |
|Sk!dbot                   |7342837 |
|Wikimedia Commons Welcome |7243842 |
|GZWDer (flood)            |6897283 |
|Dcirovicbot               |6656209 |
|Bot-Jagwar                |5984240 |
|Fæ                        |4883049 |
|QuickStatementsBot        |4423838 |
|Maintenance script        |2967554 |
|New user message          |2678820 |
|Wikinews Welcome          |2599072 |
|Welcoming Bot             |2527432 |
|Loveless                  |2404348 |
|Panoramio upload bot      |2312489 |
|MediaWiki message delivery|1889428 |
|Liangent-bot              |1743545 |
+--------------------------+--------+

spark.sql("select caused_by_event_type, count(1) as c from mwph where caused_by_user_text is null group by caused_by_event_type order by c desc limit 20").show(20, false)
+--------------------+--------+                                                 
|caused_by_event_type|c       |
+--------------------+--------+
|create              |54186396|
|delete              |84      |
|restore             |4       |
+--------------------+--------+

As expected, as lot of user_text info is not available for create events.

Nuria raised the priority of this task from Low to Normal.Mon, Apr 22, 5:20 PM
Nuria added a comment.Mon, Apr 22, 5:24 PM

Pinging Product-Analytics to get feedback for column name. "caused_by_user_text" seems a bit obscure (but maybe it's just me). How about "caused_by_additional_info"?

@Nuria: The caused_by_user_text field contains the event-performer user_text so additional_infois not accurate enough IMO. We could use a complex structure for caused_by given that we have user_id, user_text and event_type, but I'm not sure if it makes things easier.

Nuria added a comment.Tue, Apr 23, 6:24 PM

"caused_by_user_additional_text"?

Hey folks. I've been following this task, but I might not have the full context, so take what I say with a grain of salt that is appropriately sized.

"user_text" is a common term in MediaWiki data. It means "This field either contains a username or an IP address". It's often used in cases where denormalizing the username of a registered user is advantageous (e.g. the revision table) and it would be wasteful to have a separate field for IP.

When I read the term "caused_by_user_text", I assume it is related to a similar field nearby called "caused_by_user" or "caused_by_user_id" that contains either the registered user's ID number or 0 in case of an anon. If I were to come across a field called "caused_by_additional_info" or "caused_by_user_additional_text", I would not know what to expect in this field because I don't see that language used elsewhere in MediaWiki data. If the goal is to follow the common patterns of "user_text" (IP or username) with this field, then my recommendation is to call it "<something>_user_text".

Nuria added a comment.Wed, Apr 24, 3:46 PM

I see, +1 to naming then if this is some existing media wiki convention.

Nuria closed this task as Resolved.Tue, May 14, 8:34 PM
Nuria set the point value for this task to 1.