Page MenuHomePhabricator

Add user_is_bot_by to MediaWiki history
Closed, ResolvedPublic3 Story Points

Description

This is the remaining work to do from T178591.

Event Timeline

mforns moved this task from Deprioritized to Data Quality on the Analytics board.Mar 25 2019, 4:35 PM
mforns triaged this task as Normal priority.
Nuria added a comment.Mar 25 2019, 7:34 PM

Moving @Milimetric 's comment:

I remember discussing this recently, and the idea we had then was to have a single field, something like bot_detected_by which would be a list of name-regex, group, etc.. We figured this would make queries easier to write and allow the values to be more explicit without making the field name itself longer.

+1 on my end, seems a more concise way to express the same thing

Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).

JAllemandou added a project: Analytics-Kanban.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou renamed this task from Add user_is_bot_by_group to MediaWiki history to Add user_is_bot_by to MediaWiki history.

Change 504025 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update mediawiki-history user bot fields

https://gerrit.wikimedia.org/r/504025

Change 504025 merged by jenkins-bot:
[analytics/refinery/source@master] Update mediawiki-history user bot fields

https://gerrit.wikimedia.org/r/504025

spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh")

spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by  event_user_is_bot_by").show(20, false)
+--------------------+----------+                                               
|event_user_is_bot_by|c         |
+--------------------+----------+
|[name]              |306912801 |
|[]                  |2597239764|
|null                |491818452 |
|[group]             |169490265 |
|[name, group]       |1289385512|
+--------------------+----------+

// Note: To remove bot-users in a query, you should use and size(event_user_is_bot_by) = 0
spark.sql("select wiki_db, count(1) as c from mwh where size(event_user_is_bot_by) = 0 and event_entity = 'revision' group by wiki_db order by c desc").show(20, false)

+------------+---------+                                                        
|wiki_db     |c        |
+------------+---------+
|enwiki      |583505798|
|wikidatawiki|300098655|
|commonswiki |228429726|
|dewiki      |133938472|
|frwiki      |98025463 |
|eswiki      |63657679 |
|ruwiki      |63216554 |
|itwiki      |52196362 |
|jawiki      |41966746 |
|zhwiki      |32220875 |
|plwiki      |31284653 |
|nlwiki      |30559338 |
|ptwiki      |29380203 |
|shwiki      |23139895 |
|hewiki      |17659564 |
|svwiki      |17066470 |
|ukwiki      |15221952 |
|enwiktionary|15133289 |
|metawiki    |14403222 |
|huwiki      |13358970 |
+------------+---------+

Change 504025 merged by Fdans:
[analytics/refinery/source@master] Update mediawiki-history user bot fields

https://gerrit.wikimedia.org/r/504025

Nuria closed this task as Resolved.Tue, May 14, 8:35 PM
Nuria set the point value for this task to 3.