Page MenuHomePhabricator

Add user_is_bot_by to MediaWiki history
Closed, ResolvedPublic3 Estimated Story Points

Description

This is the remaining work to do from T178591.

Event Timeline

mforns triaged this task as Medium priority.Mar 25 2019, 4:35 PM
mforns moved this task from Deprioritized to Data Quality on the Analytics board.

Moving @Milimetric 's comment:

I remember discussing this recently, and the idea we had then was to have a single field, something like bot_detected_by which would be a list of name-regex, group, etc.. We figured this would make queries easier to write and allow the values to be more explicit without making the field name itself longer.

+1 on my end, seems a more concise way to express the same thing

Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).

JAllemandou renamed this task from Add user_is_bot_by_group to MediaWiki history to Add user_is_bot_by to MediaWiki history.Apr 15 2019, 2:02 PM
JAllemandou claimed this task.
JAllemandou added a project: Analytics-Kanban.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 504025 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update mediawiki-history user bot fields

https://gerrit.wikimedia.org/r/504025

Change 504025 merged by jenkins-bot:
[analytics/refinery/source@master] Update mediawiki-history user bot fields

https://gerrit.wikimedia.org/r/504025

spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh")

spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by  event_user_is_bot_by").show(20, false)
+--------------------+----------+                                               
|event_user_is_bot_by|c         |
+--------------------+----------+
|[name]              |306912801 |
|[]                  |2597239764|
|null                |491818452 |
|[group]             |169490265 |
|[name, group]       |1289385512|
+--------------------+----------+

// Note: To remove bot-users in a query, you should use and size(event_user_is_bot_by) = 0
spark.sql("select wiki_db, count(1) as c from mwh where size(event_user_is_bot_by) = 0 and event_entity = 'revision' group by wiki_db order by c desc").show(20, false)

+------------+---------+                                                        
|wiki_db     |c        |
+------------+---------+
|enwiki      |583505798|
|wikidatawiki|300098655|
|commonswiki |228429726|
|dewiki      |133938472|
|frwiki      |98025463 |
|eswiki      |63657679 |
|ruwiki      |63216554 |
|itwiki      |52196362 |
|jawiki      |41966746 |
|zhwiki      |32220875 |
|plwiki      |31284653 |
|nlwiki      |30559338 |
|ptwiki      |29380203 |
|shwiki      |23139895 |
|hewiki      |17659564 |
|svwiki      |17066470 |
|ukwiki      |15221952 |
|enwiktionary|15133289 |
|metawiki    |14403222 |
|huwiki      |13358970 |
+------------+---------+

Change 504025 merged by Fdans:
[analytics/refinery/source@master] Update mediawiki-history user bot fields

https://gerrit.wikimedia.org/r/504025

Nuria set the point value for this task to 3.