Page MenuHomePhabricator
Feed Advanced Search

Apr 25 2019

JAllemandou added a parent task for T221114: Improve mediawiki-user-history bot-by-name regex: T221828: Mediawiki-history release - Backlog.
Apr 25 2019, 8:23 AM · Data-Engineering-Icebox, Analytics
JAllemandou added a subtask for T221828: Mediawiki-history release - Backlog: T221482: Identify imported revisions in mediawiki_history.
Apr 25 2019, 8:22 AM · Data-Engineering-Icebox, Analytics
JAllemandou added a parent task for T221482: Identify imported revisions in mediawiki_history: T221828: Mediawiki-history release - Backlog.
Apr 25 2019, 8:22 AM · Data-Engineering, Product-Analytics
JAllemandou created T221828: Mediawiki-history release - Backlog.
Apr 25 2019, 8:20 AM · Data-Engineering-Icebox, Analytics
JAllemandou added subtasks for T221825: Mediawiki-history release - Snapshot 2019-06: T221482: Identify imported revisions in mediawiki_history, T221114: Improve mediawiki-user-history bot-by-name regex, T218130: Update mediawiki-history subgraph-partitioner so that it uses [page/user]_id in addition to title/text.
Apr 25 2019, 8:18 AM · Analytics-Kanban, Analytics
JAllemandou added a parent task for T218130: Update mediawiki-history subgraph-partitioner so that it uses [page/user]_id in addition to title/text: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:18 AM · Data-Engineering-Icebox, Analytics
JAllemandou added a parent task for T221482: Identify imported revisions in mediawiki_history: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:18 AM · Data-Engineering, Product-Analytics
JAllemandou added a parent task for T221114: Improve mediawiki-user-history bot-by-name regex: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:18 AM · Data-Engineering-Icebox, Analytics
JAllemandou added a parent task for T190434: Issues with page deleted dates on data lake : T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:17 AM · Patch-For-Review, Analytics, Analytics-Kanban
JAllemandou added subtasks for T221825: Mediawiki-history release - Snapshot 2019-06: T221338: Many revision events in mediawiki_history have missing page and namespace information, T218824: A few alterblocks events have event_timestamps from before 2001, T190434: Issues with page deleted dates on data lake , T214490: page_creation_timestamp not always correct in mediawiki_history, T205594: mediawiki_history missing page events, T211627: Mediawiki history has no data on IP blocks.
Apr 25 2019, 8:17 AM · Analytics-Kanban, Analytics
JAllemandou added a parent task for T211627: Mediawiki history has no data on IP blocks: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:17 AM · Data-Engineering, Anti-Harassment, Product-Analytics
JAllemandou added a parent task for T205594: mediawiki_history missing page events: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:17 AM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou added a parent task for T214490: page_creation_timestamp not always correct in mediawiki_history: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:17 AM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou added a parent task for T218824: A few alterblocks events have event_timestamps from before 2001: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:17 AM · Data-Engineering, Product-Analytics
JAllemandou added a parent task for T221338: Many revision events in mediawiki_history have missing page and namespace information: T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:17 AM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Product-Analytics
JAllemandou claimed T218824: A few alterblocks events have event_timestamps from before 2001.
Apr 25 2019, 8:14 AM · Data-Engineering, Product-Analytics
JAllemandou claimed T221338: Many revision events in mediawiki_history have missing page and namespace information.
Apr 25 2019, 8:13 AM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Product-Analytics
JAllemandou moved T190434: Issues with page deleted dates on data lake from In Code Review to In Progress on the Analytics-Kanban board.
Apr 25 2019, 8:11 AM · Patch-For-Review, Analytics, Analytics-Kanban
JAllemandou moved T214490: page_creation_timestamp not always correct in mediawiki_history from In Code Review to In Progress on the Analytics-Kanban board.
Apr 25 2019, 8:11 AM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou moved T205594: mediawiki_history missing page events from In Code Review to In Progress on the Analytics-Kanban board.
Apr 25 2019, 8:11 AM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou added a subtask for T221824: Mediawiki History Release - 2019-04 snapshot: T220456: Many small wikis missing from mediawiki_history dataset.
Apr 25 2019, 8:08 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou added a parent task for T220456: Many small wikis missing from mediawiki_history dataset: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 8:08 AM · Patch-For-Review, Analytics-Kanban, Analytics-Data-Quality, Analytics, Product-Analytics
JAllemandou updated subscribers of T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 8:06 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou created T221825: Mediawiki-history release - Snapshot 2019-06.
Apr 25 2019, 8:05 AM · Analytics-Kanban, Analytics
JAllemandou moved T221824: Mediawiki History Release - 2019-04 snapshot from Next Up to In Code Review on the Analytics-Kanban board.
Apr 25 2019, 8:02 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou claimed T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 8:02 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou updated the task description for T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 8:01 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou added a parent task for T161149: Provide edit tags in the Data Lake edit data: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:54 AM · Analytics-Kanban, Analytics
JAllemandou added a parent task for T178587: Update wikimedia-history revision data with deleted field (and find it a new name?): T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:54 AM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou added subtasks for T221824: Mediawiki History Release - 2019-04 snapshot: T167608: Add caused_by_user_text to mediawiki_page_history, T206883: mediawiki_history datasets have null user_text for IP edits, T161149: Provide edit tags in the Data Lake edit data, T178587: Update wikimedia-history revision data with deleted field (and find it a new name?), T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history, T219177: Add user_is_bot_by to MediaWiki history, T211950: Add partial blocks to mediawiki history tables, T213603: Coordinate work on minor changes for Edit Data Quality.
Apr 25 2019, 7:54 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou added a parent task for T213603: Coordinate work on minor changes for Edit Data Quality: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:54 AM · Patch-For-Review, Analytics-Kanban
JAllemandou added a parent task for T167608: Add caused_by_user_text to mediawiki_page_history: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:54 AM · Analytics-Kanban, Analytics
JAllemandou added a parent task for T211950: Add partial blocks to mediawiki history tables: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:53 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a parent task for T206883: mediawiki_history datasets have null user_text for IP edits: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:53 AM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou added a parent task for T219177: Add user_is_bot_by to MediaWiki history: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:53 AM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou added a parent task for T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history: T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:53 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou moved T213603: Coordinate work on minor changes for Edit Data Quality from In Progress to Ready to Deploy on the Analytics-Kanban board.
Apr 25 2019, 7:53 AM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T219484: Fix mediawiki-history-checker after field rename from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Apr 25 2019, 7:52 AM · Patch-For-Review, Analytics-Kanban
JAllemandou created T221824: Mediawiki History Release - 2019-04 snapshot.
Apr 25 2019, 7:49 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics

Apr 23 2019

JAllemandou added a comment to T94019: Generate RDF from JSON.

The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization.

Apr 23 2019, 4:30 PM · Patch-Needs-Improvement, [DEPRECATED] wdwb-tech, Wikidata
JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.
In T220507#5129134, @Neil_P._Quinn_WMF wrote:

I agree with the overall philosophy of being very explicit and precise in this dataset, but I do still wonder if it's necessary to provide registration timestamps from both user and logging. It's a big deal that those timestamps differ for 37% of users, but how big are those differences? If it's usually just a matter of seconds, then it doesn't seem necessary to pay the complexity cost.

Apr 23 2019, 9:58 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T167608: Add caused_by_user_text to mediawiki_page_history.

@Nuria: The caused_by_user_text field contains the event-performer user_text so additional_infois not accurate enough IMO. We could use a complex structure for caused_by given that we have user_id, user_text and event_type, but I'm not sure if it makes things easier.

Apr 23 2019, 8:38 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T221460: Remove dead code from refinery/oozie folders.

We have to remove mediawiki_history_druid

Apr 23 2019, 8:30 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :)

Apr 23 2019, 8:24 AM · Analytics-Radar, WikiCite, Dumps-Generation, Wikidata
JAllemandou added a comment to T221482: Identify imported revisions in mediawiki_history.

Super good idea and good presentation of the difficulty :)
Maybe one day ;)

Apr 23 2019, 8:22 AM · Data-Engineering, Product-Analytics
JAllemandou added a comment to T212172: Provide feature parity between the wiki replicas and the Analytics Data Lake.

Regarding the quick-lookups, I suggest using spark in shell mode (whether in python or in scala):

  • Extract the subset of data you're after and register it as a temporary table (spark.sql("SELECT * from wmf.mediawiki_history WHERE snapshot = '2019-03' and wiki_db = 'mywiki' and page_title = 'a title'").createOrReplaceTempView("myview")
  • Cache the view for fast access: spark.table("myview").cache()
  • Access the data as needed: spark.sql("SELECT count(1) from myview").show()

With the above solution, the first access (reading data and caching the table) takes some time (a few minutes max I'd say) then other requests to myview are subsecond.

Apr 23 2019, 8:19 AM · Epic, Analytics, Product-Analytics
JAllemandou added a comment to T215001: Revisions missing from mediawiki_revision_create.

I'm not working on this (yet?) - Seems event related.

Apr 23 2019, 7:30 AM · Data-Engineering, Data-Engineering-Kanban, Patch-For-Review, MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), Event-Platform, Growth-Team-Filtering, Analytics-Kanban, Growth-Team, Product-Analytics, Analytics

Apr 22 2019

nshahquinn-wmf awarded T219177: Add user_is_bot_by to MediaWiki history a Love token.
Apr 22 2019, 8:10 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics

Apr 19 2019

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

The user part of this task is in testing with the datasource located at hdfs:///user/joal/wmf/data/wmf/mediawiki/history/snaphsot=2019-03 and hdfs:///user/joal/wmf/data/wmf/mediawiki/user_history/snaphsot=2019-03 (along with a bunch of other changes).

Apr 19 2019, 3:33 PM · Analytics-Kanban, Analytics
JAllemandou moved T221460: Remove dead code from refinery/oozie folders from Next Up to In Code Review on the Analytics-Kanban board.
Apr 19 2019, 3:31 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T221460: Remove dead code from refinery/oozie folders.
Apr 19 2019, 3:30 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T221460: Remove dead code from refinery/oozie folders.
Apr 19 2019, 3:29 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T205594: mediawiki_history missing page events.

Checking improvements in new datsource.

// Current datasource - normally the problem is present in here
spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh_old")
Apr 19 2019, 12:03 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou moved T211950: Add partial blocks to mediawiki history tables from Next Up to Ready to Deploy on the Analytics-Kanban board.
Apr 19 2019, 11:20 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou claimed T211950: Add partial blocks to mediawiki history tables.
Apr 19 2019, 11:19 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a comment to T219177: Add user_is_bot_by to MediaWiki history.

spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by event_user_is_bot_by").show(20, false)
+--------------------+----------+

event_user_is_bot_byc

+--------------------+----------+

[name]306912801
[]2597239764
null491818452
[group]169490265
[name, group]1289385512

+--------------------+----------+

Apr 19 2019, 11:15 AM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou moved T210844: Generate article recommendations in Hadoop for use in production from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Apr 19 2019, 11:09 AM · Analytics-Radar, Article-Recommendation
JAllemandou moved T219177: Add user_is_bot_by to MediaWiki history from In Progress to Ready to Deploy on the Analytics-Kanban board.
Apr 19 2019, 10:20 AM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou added a comment to T167608: Add caused_by_user_text to mediawiki_page_history.

spark.sql("select caused_by_user_text, count(1) as c from mwph group by caused_by_user_text order by c desc limit 20").show(20, false)
+--------------------------+--------+

caused_by_user_textc

+--------------------------+--------+

null54186484
Lsjbot17517304
Research Bot16512521
TuanminhBot11023605
Meta-Wiki Welcome8611228
Sk!dbot7342837
Wikimedia Commons Welcome7243842
GZWDer (flood)6897283
Dcirovicbot6656209
Bot-Jagwar5984240
4883049
QuickStatementsBot4423838
Maintenance script2967554
New user message2678820
Wikinews Welcome2599072
Welcoming Bot2527432
Loveless2404348
Panoramio upload bot2312489
MediaWiki message delivery1889428
Liangent-bot1743545

+--------------------------+--------+

Apr 19 2019, 10:18 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T178587: Update wikimedia-history revision data with deleted field (and find it a new name?).

Data is available in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

Apr 19 2019, 10:11 AM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou moved T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history from Next Up to Ready to Deploy on the Analytics-Kanban board.
Apr 19 2019, 9:59 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou claimed T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.
Apr 19 2019, 9:59 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou added a project to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history: Analytics-Kanban.
Apr 19 2019, 9:58 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou added a comment to T206883: mediawiki_history datasets have null user_text for IP edits.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

Apr 19 2019, 9:56 AM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou added a comment to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Hi @Neil_P._Quinn_WMF, sorry for the big comment above - Do you mind having a look and confirming this looks ok for you? Many thanks :)

Apr 19 2019, 9:50 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou added a comment to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

// Current datasource - normally the problem is present in here
val odf = spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
val oudf = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
Apr 19 2019, 9:49 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

Apr 17 2019

JAllemandou added a comment to T161149: Provide edit tags in the Data Lake edit data.

Hi @Neil_P._Quinn_WMF - Test data is available :)
Here is an example of accessing it in scala-spark2:

val history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
history.where("event_entity = 'revision' and wiki_db = 'enwiki' and revision_tags is not null and size(revision_tags) > 0").select("event_timestamp", "revision_id", "revision_tags").show(100, false)
Apr 17 2019, 4:06 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T211950: Add partial blocks to mediawiki history tables.

Hi @nettrom_WMF - I have a test dataset for you that include this data (example in scala-spark2:

val user_history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
user_history..where("caused_by_event_type = 'alterblocks' and wiki_db = 'itwiki' and start_timestamp like '2019-01%' and source_log_params['sitewide'] = 'false' and source_log_params['7::restrictions'] is not null").select("source_log_params").show(100, false)
Apr 17 2019, 3:58 PM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a comment to T210844: Generate article recommendations in Hadoop for use in production.

Actually it'll not finish - We just killed it as we need to restart the cluster (planned maintenance - see https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html). Sorry for that :( Hopefully you'll still have enough logs.

Apr 17 2019, 2:23 PM · Analytics-Radar, Article-Recommendation
JAllemandou added a comment to T210844: Generate article recommendations in Hadoop for use in production.

Hi @bmansurov,
I've been monitoring the current run of the recommender (https://yarn.wikimedia.org/proxy/application_1553764233554_69057/), and I think the approach in term of cluster usage is not sustainable. Particularly, generating the pageviews datasets for the top 50 languages should be done in a single pass, and written in partitioned folders (using partitionBy). Currently, there are 50 runs each reading the whole pageview needed data multiple times - This represents reading 50 *2 *1Tb of data. Doing it in one pass will save a lot of IOs as well as a lot of time :)

Apr 17 2019, 2:06 PM · Analytics-Radar, Article-Recommendation
JAllemandou added a comment to T206894: Set up automated email to report completion of mediawiki_history snapshot and Druid loading.

@Neil_P._Quinn_WMF : Indeed the validation job failed for expected reasons (a higher than usual group-bot-removal, a dimension against which our validation is not very stable). We restarted the job manually with higher threshold: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0171576-181112144035577-oozie-oozi-C
In this job, the email was configured to be sent to multiple emails, but it seems it failed (I didn't received it either): https://hue.wikimedia.org/oozie/list_oozie_workflow_action/0171577-181112144035577-oozie-oozi-W%40send_success_email/?coordinator_job_id=0171576-181112144035577-oozie-oozi-C.
Let's double check again next month if the email gets sent.

Apr 17 2019, 12:55 PM · Analytics-Kanban, Analytics, Contributors-Analysis, Product-Analytics

Apr 16 2019

JAllemandou created T221114: Improve mediawiki-user-history bot-by-name regex.
Apr 16 2019, 5:18 PM · Data-Engineering-Icebox, Analytics

Apr 15 2019

JAllemandou renamed T219177: Add user_is_bot_by to MediaWiki history from Add user_is_bot_by_group to MediaWiki history to Add user_is_bot_by to MediaWiki history.
Apr 15 2019, 2:03 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou moved T219177: Add user_is_bot_by to MediaWiki history from Next Up to In Progress on the Analytics-Kanban board.
Apr 15 2019, 2:03 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou claimed T219177: Add user_is_bot_by to MediaWiki history.
Apr 15 2019, 2:03 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou added a comment to T219177: Add user_is_bot_by to MediaWiki history.

Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).

Apr 15 2019, 2:02 PM · Patch-For-Review, Analytics-Kanban, Data-Engineering-Wikistats, Analytics
JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

I did a quick analysis using Spark on user data after @nettrom_WMF comment:

import com.databricks.spark.avro._
// user table data
val u = spark.read.avro("/wmf/data/raw/mediawiki/tables/user/snapshot=2019-03").
        select("wiki_db", "user_id", "user_registration")
// logging table data
val l = spark.read.avro("/wmf/data/raw/mediawiki/tables/logging/snapshot=2019-03").
        where("log_type = 'newusers' and log_user is not null and log_user > 0").
        select("wiki_db", "log_user", "log_timestamp")
// joined data on wiki_db and user_id
val j = u.join(l, u("wiki_db") === l("wiki_db") && u("user_id") === l("log_user")).cache()
Apr 15 2019, 8:01 AM · Analytics-Kanban, Analytics

Apr 12 2019

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

Thanks for your comment @nettrom_WMF - I should have explained the plan more thoroughtly.
In the next changes for mediawiki-history, we will add fields for pages and users, ending up in having pageCreationTimestamp and pageFirstEditTimestamp coherent by page_id for each page-event, and similarly for users.
the precise definition of how those values is as follow:

  • pageCreationTimestamp - Timestamp of the page-create event in logging table if it exists, null otherwise.
  • pageFirstEditTimestamp - Timestamp of the oldest revision associated to the page (by page_id), whether in revision or archive table.
  • userCreationTimestamp - oldest from user_registration (in user-table) and user-create event in logging table (if both exist, otherwise the existing one if only one exist, otherwise null).
  • userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.
Apr 12 2019, 11:24 AM · Analytics-Kanban, Analytics

Apr 11 2019

JAllemandou moved T220111: Refactor druid data deletion script from In Progress to In Code Review on the Analytics-Kanban board.
Apr 11 2019, 1:38 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

Ping @Neil_P._Quinn_WMF and @nettrom_WMF - I'll move forward with the suggested implementation this end of week to have it tested next week :)

Apr 11 2019, 10:11 AM · Analytics-Kanban, Analytics

Apr 9 2019

JAllemandou updated the task description for T220507: Decide: start_timestamp for mediawiki history.
Apr 9 2019, 2:36 PM · Analytics-Kanban, Analytics

Apr 8 2019

JAllemandou added a comment to T218901: Track number of Wikidata edits by namespace.

Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :)

Apr 8 2019, 12:26 PM · Shape Expressions Sprint 5, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), User-Michael, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year with mostly daily granularity. I have made the following patch to restrict timespan of per-page requests to 1 year: https://gerrit.wikimedia.org/r/c/analytics/aqs/+/502198

Apr 8 2019, 11:35 AM · Patch-For-Review, Analytics-Kanban, Analytics

Apr 5 2019

JAllemandou updated the task description for T220111: Refactor druid data deletion script.
Apr 5 2019, 4:48 PM · Analytics-Kanban, Analytics
JAllemandou renamed T220111: Refactor druid data deletion script from Fix druid-public drop-snapshot script to Refactor druid data deletion script.
Apr 5 2019, 4:46 PM · Analytics-Kanban, Analytics
JAllemandou claimed T220111: Refactor druid data deletion script.
Apr 5 2019, 11:59 AM · Analytics-Kanban, Analytics
JAllemandou moved T220111: Refactor druid data deletion script from Next Up to In Progress on the Analytics-Kanban board.
Apr 5 2019, 11:59 AM · Analytics-Kanban, Analytics

Apr 4 2019

JAllemandou moved T220012: Enable pagecount-ez cron on stats boxes from Next Up to Done on the Analytics-Kanban board.
Apr 4 2019, 3:58 PM · Analytics, Analytics-Kanban
JAllemandou created T220111: Refactor druid data deletion script.
Apr 4 2019, 1:59 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T218901: Track number of Wikidata edits by namespace.

Reading about this - Would delayed data be interesting? This information is accessible in hadoop :)

Apr 4 2019, 10:44 AM · Shape Expressions Sprint 5, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), User-Michael, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T204965: Create report for "articles with most contributors" in Wikistats2.
Apr 4 2019, 10:39 AM · Data-Engineering, Patch-For-Review, Data-Engineering-Wikistats
JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.
Apr 4 2019, 7:29 AM · Patch-For-Review, Analytics-Kanban, Analytics

Apr 3 2019

JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

Shall I send a PR updating rate-limiting in restbase for edits-per-page requests to 10?
https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L1551

Apr 3 2019, 5:17 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T215550: Test sqooping from the new dedicated labsdb host from Ready to Deploy to Done on the Analytics-Kanban board.
Apr 3 2019, 2:20 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

I think the problem experienced yesterday could have been prevented by T189623.
Data backup:

Apr 3 2019, 12:32 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

We prefer the ISO-8601 strings for serialization everywhere.

Apr 3 2019, 12:17 PM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Analytics, Product-Analytics

Mar 28 2019

JAllemandou moved T219484: Fix mediawiki-history-checker after field rename from Next Up to In Code Review on the Analytics-Kanban board.
Mar 28 2019, 8:16 AM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T219484: Fix mediawiki-history-checker after field rename.
Mar 28 2019, 8:16 AM · Patch-For-Review, Analytics-Kanban
JAllemandou created T219484: Fix mediawiki-history-checker after field rename.
Mar 28 2019, 8:12 AM · Patch-For-Review, Analytics-Kanban

Mar 27 2019

JAllemandou moved T167608: Add caused_by_user_text to mediawiki_page_history from In Progress to Ready to Deploy on the Analytics-Kanban board.
Mar 27 2019, 4:28 PM · Analytics-Kanban, Analytics