Page MenuHomePhabricator

Add caused_by_user_text to mediawiki_page_history
Closed, ResolvedPublic1 Estimated Story Points

Description

Re. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history

Add a new field cused_by_user_text which would contain the IP address in the case that a page is created by an IP editor. IPs can and do create articles in all wikis except English Wikipedia and there are many other types of pages that IPs create in English Wikipedia.

Event Timeline

Is this about listing the IP in the page create events that have it?

Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/page_history/snapshot=2019-03").createOrReplaceTempView("mwph")

spark.sql("select caused_by_user_text, count(1) as c from mwph group by caused_by_user_text order by c desc limit 20").show(20, false)
+--------------------------+--------+                                           
|caused_by_user_text       |c       |
+--------------------------+--------+
|null                      |54186484|
|Lsjbot                    |17517304|
|Research Bot              |16512521|
|TuanminhBot               |11023605|
|Meta-Wiki Welcome         |8611228 |
|Sk!dbot                   |7342837 |
|Wikimedia Commons Welcome |7243842 |
|GZWDer (flood)            |6897283 |
|Dcirovicbot               |6656209 |
|Bot-Jagwar                |5984240 |
|Fæ                        |4883049 |
|QuickStatementsBot        |4423838 |
|Maintenance script        |2967554 |
|New user message          |2678820 |
|Wikinews Welcome          |2599072 |
|Welcoming Bot             |2527432 |
|Loveless                  |2404348 |
|Panoramio upload bot      |2312489 |
|MediaWiki message delivery|1889428 |
|Liangent-bot              |1743545 |
+--------------------------+--------+

spark.sql("select caused_by_event_type, count(1) as c from mwph where caused_by_user_text is null group by caused_by_event_type order by c desc limit 20").show(20, false)
+--------------------+--------+                                                 
|caused_by_event_type|c       |
+--------------------+--------+
|create              |54186396|
|delete              |84      |
|restore             |4       |
+--------------------+--------+

As expected, as lot of user_text info is not available for create events.

Nuria raised the priority of this task from Low to Medium.Apr 22 2019, 5:20 PM

Pinging Product-Analytics to get feedback for column name. "caused_by_user_text" seems a bit obscure (but maybe it's just me). How about "caused_by_additional_info"?

@Nuria: The caused_by_user_text field contains the event-performer user_text so additional_infois not accurate enough IMO. We could use a complex structure for caused_by given that we have user_id, user_text and event_type, but I'm not sure if it makes things easier.

"caused_by_user_additional_text"?

Hey folks. I've been following this task, but I might not have the full context, so take what I say with a grain of salt that is appropriately sized.

"user_text" is a common term in MediaWiki data. It means "This field either contains a username or an IP address". It's often used in cases where denormalizing the username of a registered user is advantageous (e.g. the revision table) and it would be wasteful to have a separate field for IP.

When I read the term "caused_by_user_text", I assume it is related to a similar field nearby called "caused_by_user" or "caused_by_user_id" that contains either the registered user's ID number or 0 in case of an anon. If I were to come across a field called "caused_by_additional_info" or "caused_by_user_additional_text", I would not know what to expect in this field because I don't see that language used elsewhere in MediaWiki data. If the goal is to follow the common patterns of "user_text" (IP or username) with this field, then my recommendation is to call it "<something>_user_text".

I see, +1 to naming then if this is some existing media wiki convention.

Nuria set the point value for this task to 1.