In T220507#5129134, @Neil_P._Quinn_WMF wrote:

I agree with the overall philosophy of being very explicit and precise in this dataset, but I do still wonder if it's necessary to provide registration timestamps from both user and logging. It's a big deal that those timestamps differ for 37% of users, but how big are those differences? If it's usually just a matter of seconds, then it doesn't seem necessary to pay the complexity cost.

Apr 23 2019, 9:58 AM · Analytics-Kanban, Analytics

JAllemandou added a comment to T167608: Add caused_by_user_text to mediawiki_page_history.

@Nuria: The caused_by_user_text field contains the event-performer user_text so additional_infois not accurate enough IMO. We could use a complex structure for caused_by given that we have user_id, user_text and event_type, but I'm not sure if it makes things easier.

Apr 23 2019, 8:38 AM · Analytics-Kanban, Analytics

JAllemandou added a comment to T221460: Remove dead code from refinery/oozie folders.

In T221460#5128400, @fdans wrote:

We have to remove mediawiki_history_druid

Apr 23 2019, 8:30 AM · Patch-For-Review, Analytics-Kanban, Analytics

JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :)

Apr 23 2019, 8:24 AM · Analytics-Radar, WikiCite, Dumps-Generation, Wikidata

JAllemandou added a comment to T221482: Identify imported revisions in mediawiki_history.

Super good idea and good presentation of the difficulty :)
Maybe one day ;)

Apr 23 2019, 8:22 AM · Data-Engineering, Product-Analytics

JAllemandou added a comment to T212172: Provide feature parity between the wiki replicas and the Analytics Data Lake.

Regarding the quick-lookups, I suggest using spark in shell mode (whether in python or in scala):

Extract the subset of data you're after and register it as a temporary table (spark.sql("SELECT * from wmf.mediawiki_history WHERE snapshot = '2019-03' and wiki_db = 'mywiki' and page_title = 'a title'").createOrReplaceTempView("myview")
Cache the view for fast access: spark.table("myview").cache()
Access the data as needed: spark.sql("SELECT count(1) from myview").show()

With the above solution, the first access (reading data and caching the table) takes some time (a few minutes max I'd say) then other requests to myview are subsecond.

Apr 23 2019, 8:19 AM · Epic, Analytics, Product-Analytics

JAllemandou added a comment to T215001: Revisions missing from mediawiki_revision_create.

I'm not working on this (yet?) - Seems event related.

Apr 23 2019, 7:30 AM · Data-Engineering, Data-Engineering-Kanban, Patch-For-Review, MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), Event-Platform, Growth-Team-Filtering, Analytics-Kanban, Growth-Team, Product-Analytics, Analytics

Apr 22 2019

nshahquinn-wmf awarded T219177: Add user_is_bot_by to MediaWiki history a Love token.

Apr 22 2019, 8:10 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

Apr 19 2019

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

The user part of this task is in testing with the datasource located at hdfs:///user/joal/wmf/data/wmf/mediawiki/history/snaphsot=2019-03 and hdfs:///user/joal/wmf/data/wmf/mediawiki/user_history/snaphsot=2019-03 (along with a bunch of other changes).

Apr 19 2019, 3:33 PM · Analytics-Kanban, Analytics

JAllemandou moved T221460: Remove dead code from refinery/oozie folders from Next Up to In Code Review on the Analytics-Kanban board.

Apr 19 2019, 3:31 PM · Patch-For-Review, Analytics-Kanban, Analytics

JAllemandou claimed T221460: Remove dead code from refinery/oozie folders.

Apr 19 2019, 3:30 PM · Patch-For-Review, Analytics-Kanban, Analytics

JAllemandou created T221460: Remove dead code from refinery/oozie folders.

Apr 19 2019, 3:29 PM · Patch-For-Review, Analytics-Kanban, Analytics

JAllemandou added a comment to T205594: mediawiki_history missing page events.

Checking improvements in new datsource.

// Current datasource - normally the problem is present in here
spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh_old")

Apr 19 2019, 12:03 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics

JAllemandou moved T211950: Add partial blocks to mediawiki history tables from Next Up to Ready to Deploy on the Analytics-Kanban board.

Apr 19 2019, 11:20 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics

JAllemandou claimed T211950: Add partial blocks to mediawiki history tables.

Apr 19 2019, 11:19 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics

JAllemandou added a comment to T219177: Add user_is_bot_by to MediaWiki history.

spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by event_user_is_bot_by").show(20, false)
+--------------------+----------+

event_user_is_bot_by

+--------------------+----------+

[name]	306912801
[]	2597239764
null	491818452
[group]	169490265
[name, group]	1289385512

+--------------------+----------+

Apr 19 2019, 11:15 AM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

JAllemandou moved T210844: Generate article recommendations in Hadoop for use in production from Ready to Deploy to In Code Review on the Analytics-Kanban board.

Apr 19 2019, 11:09 AM · Analytics-Radar, Article-Recommendation

JAllemandou moved T219177: Add user_is_bot_by to MediaWiki history from In Progress to Ready to Deploy on the Analytics-Kanban board.

Apr 19 2019, 10:20 AM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

JAllemandou added a comment to T167608: Add caused_by_user_text to mediawiki_page_history.

spark.sql("select caused_by_user_text, count(1) as c from mwph group by caused_by_user_text order by c desc limit 20").show(20, false)
+--------------------------+--------+

caused_by_user_text

+--------------------------+--------+

null	54186484
Lsjbot	17517304
Research Bot	16512521
TuanminhBot	11023605
Meta-Wiki Welcome	8611228
Sk!dbot	7342837
Wikimedia Commons Welcome	7243842
GZWDer (flood)	6897283
Dcirovicbot	6656209
Bot-Jagwar	5984240
Fæ	4883049
QuickStatementsBot	4423838
Maintenance script	2967554
New user message	2678820
Wikinews Welcome	2599072
Welcoming Bot	2527432
Loveless	2404348
Panoramio upload bot	2312489
MediaWiki message delivery	1889428
Liangent-bot	1743545

+--------------------------+--------+

Apr 19 2019, 10:18 AM · Analytics-Kanban, Analytics

JAllemandou added a comment to T178587: Update wikimedia-history revision data with deleted field (and find it a new name?).

Data is available in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

Apr 19 2019, 10:11 AM · Analytics-Kanban, Patch-For-Review, Analytics

JAllemandou moved T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history from Next Up to Ready to Deploy on the Analytics-Kanban board.

Apr 19 2019, 9:59 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

JAllemandou claimed T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Apr 19 2019, 9:59 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

JAllemandou added a project to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history: Analytics-Kanban.

Apr 19 2019, 9:58 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

JAllemandou added a comment to T206883: mediawiki_history datasets have null user_text for IP edits.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

Apr 19 2019, 9:56 AM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics

JAllemandou added a comment to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Hi @Neil_P._Quinn_WMF, sorry for the big comment above - Do you mind having a look and confirming this looks ok for you? Many thanks :)

Apr 19 2019, 9:50 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

JAllemandou added a comment to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

// Current datasource - normally the problem is present in here
val odf = spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
val oudf = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")

Apr 19 2019, 9:49 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

Apr 17 2019

JAllemandou added a comment to T161149: Provide edit tags in the Data Lake edit data.

Hi @Neil_P._Quinn_WMF - Test data is available :)
Here is an example of accessing it in scala-spark2:

val history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
history.where("event_entity = 'revision' and wiki_db = 'enwiki' and revision_tags is not null and size(revision_tags) > 0").select("event_timestamp", "revision_id", "revision_tags").show(100, false)

Apr 17 2019, 4:06 PM · Analytics-Kanban, Analytics

JAllemandou added a comment to T211950: Add partial blocks to mediawiki history tables.

Hi @nettrom_WMF - I have a test dataset for you that include this data (example in scala-spark2:

val user_history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
user_history..where("caused_by_event_type = 'alterblocks' and wiki_db = 'itwiki' and start_timestamp like '2019-01%' and source_log_params['sitewide'] = 'false' and source_log_params['7::restrictions'] is not null").select("source_log_params").show(100, false)

Apr 17 2019, 3:58 PM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics

JAllemandou added a comment to T210844: Generate article recommendations in Hadoop for use in production.

Actually it'll not finish - We just killed it as we need to restart the cluster (planned maintenance - see https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html). Sorry for that :( Hopefully you'll still have enough logs.

Apr 17 2019, 2:23 PM · Analytics-Radar, Article-Recommendation

JAllemandou added a comment to T210844: Generate article recommendations in Hadoop for use in production.

Hi @bmansurov,
I've been monitoring the current run of the recommender (https://yarn.wikimedia.org/proxy/application_1553764233554_69057/), and I think the approach in term of cluster usage is not sustainable. Particularly, generating the pageviews datasets for the top 50 languages should be done in a single pass, and written in partitioned folders (using partitionBy). Currently, there are 50 runs each reading the whole pageview needed data multiple times - This represents reading 50 *2 *1Tb of data. Doing it in one pass will save a lot of IOs as well as a lot of time :)

Apr 17 2019, 2:06 PM · Analytics-Radar, Article-Recommendation

JAllemandou added a comment to T206894: Set up automated email to report completion of mediawiki_history snapshot and Druid loading.

@Neil_P._Quinn_WMF : Indeed the validation job failed for expected reasons (a higher than usual group-bot-removal, a dimension against which our validation is not very stable). We restarted the job manually with higher threshold: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0171576-181112144035577-oozie-oozi-C
In this job, the email was configured to be sent to multiple emails, but it seems it failed (I didn't received it either): https://hue.wikimedia.org/oozie/list_oozie_workflow_action/0171577-181112144035577-oozie-oozi-W%40send_success_email/?coordinator_job_id=0171576-181112144035577-oozie-oozi-C.
Let's double check again next month if the email gets sent.

Apr 17 2019, 12:55 PM · Analytics-Kanban, Analytics, Contributors-Analysis, Product-Analytics

Apr 16 2019

JAllemandou created T221114: Improve mediawiki-user-history bot-by-name regex.

Apr 16 2019, 5:18 PM · Data-Engineering-Icebox, Analytics

Apr 15 2019

JAllemandou renamed T219177: Add user_is_bot_by to MediaWiki history from Add user_is_bot_by_group to MediaWiki history to Add user_is_bot_by to MediaWiki history.

Apr 15 2019, 2:03 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

JAllemandou moved T219177: Add user_is_bot_by to MediaWiki history from Next Up to In Progress on the Analytics-Kanban board.

Apr 15 2019, 2:03 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

JAllemandou claimed T219177: Add user_is_bot_by to MediaWiki history.

Apr 15 2019, 2:03 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

JAllemandou added a comment to T219177: Add user_is_bot_by to MediaWiki history.

Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).

Apr 15 2019, 2:02 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

I did a quick analysis using Spark on user data after @nettrom_WMF comment:

import com.databricks.spark.avro._
// user table data
val u = spark.read.avro("/wmf/data/raw/mediawiki/tables/user/snapshot=2019-03").
        select("wiki_db", "user_id", "user_registration")
// logging table data
val l = spark.read.avro("/wmf/data/raw/mediawiki/tables/logging/snapshot=2019-03").
        where("log_type = 'newusers' and log_user is not null and log_user > 0").
        select("wiki_db", "log_user", "log_timestamp")
// joined data on wiki_db and user_id
val j = u.join(l, u("wiki_db") === l("wiki_db") && u("user_id") === l("log_user")).cache()

Apr 15 2019, 8:01 AM · Analytics-Kanban, Analytics

Apr 12 2019

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

Thanks for your comment @nettrom_WMF - I should have explained the plan more thoroughtly.
In the next changes for mediawiki-history, we will add fields for pages and users, ending up in having pageCreationTimestamp and pageFirstEditTimestamp coherent by page_id for each page-event, and similarly for users.
the precise definition of how those values is as follow:

pageCreationTimestamp - Timestamp of the page-create event in logging table if it exists, null otherwise.
pageFirstEditTimestamp - Timestamp of the oldest revision associated to the page (by page_id), whether in revision or archive table.
userCreationTimestamp - oldest from user_registration (in user-table) and user-create event in logging table (if both exist, otherwise the existing one if only one exist, otherwise null).
userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.

Apr 12 2019, 11:24 AM · Analytics-Kanban, Analytics

Apr 11 2019

JAllemandou moved T220111: Refactor druid data deletion script from In Progress to In Code Review on the Analytics-Kanban board.

Apr 11 2019, 1:38 PM · Analytics-Kanban, Analytics

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

Ping @Neil_P._Quinn_WMF and @nettrom_WMF - I'll move forward with the suggested implementation this end of week to have it tested next week :)

Apr 11 2019, 10:11 AM · Analytics-Kanban, Analytics

Apr 9 2019

JAllemandou updated the task description for T220507: Decide: start_timestamp for mediawiki history.

Apr 9 2019, 2:36 PM · Analytics-Kanban, Analytics

Apr 8 2019

JAllemandou added a comment to T218901: Track number of Wikidata edits by namespace.

Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :)

Apr 8 2019, 12:26 PM · Shape Expressions Sprint 5, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), User-Michael, WMDE-Analytics-Engineering, Wikidata

JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year with mostly daily granularity. I have made the following patch to restrict timespan of per-page requests to 1 year: https://gerrit.wikimedia.org/r/c/analytics/aqs/+/502198

Apr 8 2019, 11:35 AM · Patch-For-Review, Analytics-Kanban, Analytics

Apr 5 2019

JAllemandou updated the task description for T220111: Refactor druid data deletion script.

Apr 5 2019, 4:48 PM · Analytics-Kanban, Analytics

JAllemandou renamed T220111: Refactor druid data deletion script from Fix druid-public drop-snapshot script to Refactor druid data deletion script.

Apr 5 2019, 4:46 PM · Analytics-Kanban, Analytics

JAllemandou claimed T220111: Refactor druid data deletion script.

Apr 5 2019, 11:59 AM · Analytics-Kanban, Analytics

JAllemandou moved T220111: Refactor druid data deletion script from Next Up to In Progress on the Analytics-Kanban board.

Apr 5 2019, 11:59 AM · Analytics-Kanban, Analytics

Apr 4 2019

JAllemandou moved T220012: Enable pagecount-ez cron on stats boxes from Next Up to Done on the Analytics-Kanban board.

Apr 4 2019, 3:58 PM · Analytics, Analytics-Kanban

JAllemandou created T220111: Refactor druid data deletion script.

Apr 4 2019, 1:59 PM · Analytics-Kanban, Analytics

JAllemandou added a comment to T218901: Track number of Wikidata edits by namespace.

Reading about this - Would delayed data be interesting? This information is accessible in hadoop :)

Apr 4 2019, 10:44 AM · Shape Expressions Sprint 5, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), User-Michael, WMDE-Analytics-Engineering, Wikidata

JAllemandou added a comment to T204965: Create report for "articles with most contributors" in Wikistats2.

Apr 4 2019, 10:39 AM · Data-Engineering, Patch-For-Review, Data-Engineering-Wikistats

JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

In T219910#5083908, @Nuria wrote:

@JAllemandou the title of this graph looks like it needs changing?

https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=43&fullscreen&orgId=1&from=now-1d%2Fd&to=now-1d%2Fd