This is the remaining work to do from T178591.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Update mediawiki-history user bot fields | analytics/refinery/source | master | +166 -109 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics | |||
| Resolved | None | T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days | |||
| Resolved | • odimitrijevic | T130256 Wikistats 2.0. | |||
| Resolved | None | T143924 Replacing standard edit metrics in dashiki with data from new edit data depot | |||
| Resolved | JAllemandou | T152035 Productionize Edit History Reconstruction and Extraction | |||
| Declined | None | T153923 vet edit data on the data lake | |||
| Resolved | JAllemandou | T178591 Feedback on hive table mediawiki_history by Erik Z | |||
| Resolved | JAllemandou | T221824 Mediawiki History Release - 2019-04 snapshot | |||
| Resolved | JAllemandou | T219177 Add user_is_bot_by to MediaWiki history |
Event Timeline
Moving @Milimetric 's comment:
I remember discussing this recently, and the idea we had then was to have a single field, something like bot_detected_by which would be a list of name-regex, group, etc.. We figured this would make queries easier to write and allow the values to be more explicit without making the field name itself longer.
+1 on my end, seems a more concise way to express the same thing
Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).
Change 504025 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update mediawiki-history user bot fields
Change 504025 merged by jenkins-bot:
[analytics/refinery/source@master] Update mediawiki-history user bot fields
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh")
spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by event_user_is_bot_by").show(20, false)
+--------------------+----------+
|event_user_is_bot_by|c |
+--------------------+----------+
|[name] |306912801 |
|[] |2597239764|
|null |491818452 |
|[group] |169490265 |
|[name, group] |1289385512|
+--------------------+----------+
// Note: To remove bot-users in a query, you should use and size(event_user_is_bot_by) = 0
spark.sql("select wiki_db, count(1) as c from mwh where size(event_user_is_bot_by) = 0 and event_entity = 'revision' group by wiki_db order by c desc").show(20, false)
+------------+---------+
|wiki_db |c |
+------------+---------+
|enwiki |583505798|
|wikidatawiki|300098655|
|commonswiki |228429726|
|dewiki |133938472|
|frwiki |98025463 |
|eswiki |63657679 |
|ruwiki |63216554 |
|itwiki |52196362 |
|jawiki |41966746 |
|zhwiki |32220875 |
|plwiki |31284653 |
|nlwiki |30559338 |
|ptwiki |29380203 |
|shwiki |23139895 |
|hewiki |17659564 |
|svwiki |17066470 |
|ukwiki |15221952 |
|enwiktionary|15133289 |
|metawiki |14403222 |
|huwiki |13358970 |
+------------+---------+Change 504025 merged by Fdans:
[analytics/refinery/source@master] Update mediawiki-history user bot fields