This is the remaining work to do from T178591.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
analytics/refinery/source | master | +166 -109 | Update mediawiki-history user bot fields |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics | |||
Resolved | None | T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days | |||
Resolved | odimitrijevic | T130256 Wikistats 2.0. | |||
Resolved | None | T143924 Replacing standard edit metrics in dashiki with data from new edit data depot | |||
Resolved | JAllemandou | T152035 Productionize Edit History Reconstruction and Extraction | |||
Declined | None | T153923 vet edit data on the data lake | |||
Resolved | JAllemandou | T178591 Feedback on hive table mediawiki_history by Erik Z | |||
Resolved | JAllemandou | T221824 Mediawiki History Release - 2019-04 snapshot | |||
Resolved | JAllemandou | T219177 Add user_is_bot_by to MediaWiki history |
Event Timeline
Moving @Milimetric 's comment:
I remember discussing this recently, and the idea we had then was to have a single field, something like bot_detected_by which would be a list of name-regex, group, etc.. We figured this would make queries easier to write and allow the values to be more explicit without making the field name itself longer.
+1 on my end, seems a more concise way to express the same thing
Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).
Change 504025 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update mediawiki-history user bot fields
Change 504025 merged by jenkins-bot:
[analytics/refinery/source@master] Update mediawiki-history user bot fields
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh") spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by event_user_is_bot_by").show(20, false) +--------------------+----------+ |event_user_is_bot_by|c | +--------------------+----------+ |[name] |306912801 | |[] |2597239764| |null |491818452 | |[group] |169490265 | |[name, group] |1289385512| +--------------------+----------+ // Note: To remove bot-users in a query, you should use and size(event_user_is_bot_by) = 0 spark.sql("select wiki_db, count(1) as c from mwh where size(event_user_is_bot_by) = 0 and event_entity = 'revision' group by wiki_db order by c desc").show(20, false) +------------+---------+ |wiki_db |c | +------------+---------+ |enwiki |583505798| |wikidatawiki|300098655| |commonswiki |228429726| |dewiki |133938472| |frwiki |98025463 | |eswiki |63657679 | |ruwiki |63216554 | |itwiki |52196362 | |jawiki |41966746 | |zhwiki |32220875 | |plwiki |31284653 | |nlwiki |30559338 | |ptwiki |29380203 | |shwiki |23139895 | |hewiki |17659564 | |svwiki |17066470 | |ukwiki |15221952 | |enwiktionary|15133289 | |metawiki |14403222 | |huwiki |13358970 | +------------+---------+
Change 504025 merged by Fdans:
[analytics/refinery/source@master] Update mediawiki-history user bot fields