Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (218 w, 4 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 19

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

The user part of this task is in testing with the datasource located at hdfs:///user/joal/wmf/data/wmf/mediawiki/history/snaphsot=2019-03 and hdfs:///user/joal/wmf/data/wmf/mediawiki/user_history/snaphsot=2019-03 (along with a bunch of other changes).

Fri, Apr 19, 3:33 PM · Analytics-Kanban, Analytics
JAllemandou moved T221460: Remove dead code from refinery/oozie folders from Next Up to In Code Review on the Analytics-Kanban board.
Fri, Apr 19, 3:31 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T221460: Remove dead code from refinery/oozie folders.
Fri, Apr 19, 3:30 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T221460: Remove dead code from refinery/oozie folders.
Fri, Apr 19, 3:29 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T205594: mediawiki_history missing page events.

Checking improvements in new datsource.

// Current datasource - normally the problem is present in here
spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh_old")
Fri, Apr 19, 12:03 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou moved T211950: Add partial blocks to mediawiki history tables from Next Up to Ready to Deploy on the Analytics-Kanban board.
Fri, Apr 19, 11:20 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou claimed T211950: Add partial blocks to mediawiki history tables.
Fri, Apr 19, 11:19 AM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a comment to T219177: Add user_is_bot_by to MediaWiki history.

spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by event_user_is_bot_by").show(20, false)
+--------------------+----------+

event_user_is_bot_byc

+--------------------+----------+

[name]306912801
[]2597239764
null491818452
[group]169490265
[name, group]1289385512

+--------------------+----------+

Fri, Apr 19, 11:15 AM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T210844: Generate article recommendations in Hadoop for use in production from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Fri, Apr 19, 11:09 AM · Patch-For-Review, Analytics-Kanban, Article-Recommendation, Research, Analytics
JAllemandou moved T219177: Add user_is_bot_by to MediaWiki history from In Progress to Ready to Deploy on the Analytics-Kanban board.
Fri, Apr 19, 10:20 AM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou added a comment to T167608: Add caused_by_user_text to mediawiki_page_history.

spark.sql("select caused_by_user_text, count(1) as c from mwph group by caused_by_user_text order by c desc limit 20").show(20, false)
+--------------------------+--------+

caused_by_user_textc

+--------------------------+--------+

null54186484
Lsjbot17517304
Research Bot16512521
TuanminhBot11023605
Meta-Wiki Welcome8611228
Sk!dbot7342837
Wikimedia Commons Welcome7243842
GZWDer (flood)6897283
Dcirovicbot6656209
Bot-Jagwar5984240
4883049
QuickStatementsBot4423838
Maintenance script2967554
New user message2678820
Wikinews Welcome2599072
Welcoming Bot2527432
Loveless2404348
Panoramio upload bot2312489
MediaWiki message delivery1889428
Liangent-bot1743545

+--------------------------+--------+

Fri, Apr 19, 10:18 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T178587: Update wikimedia-history revision data with deleted field (and find it a new name?).

Data is available in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

Fri, Apr 19, 10:11 AM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou moved T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history from Next Up to Ready to Deploy on the Analytics-Kanban board.
Fri, Apr 19, 9:59 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou claimed T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.
Fri, Apr 19, 9:59 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou added a project to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history: Analytics-Kanban.
Fri, Apr 19, 9:58 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou added a comment to T206883: mediawiki_history datasets have null user_text for IP edits.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

Fri, Apr 19, 9:56 AM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou added a comment to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Hi @Neil_P._Quinn_WMF, sorry for the big comment above - Do you mind having a look and confirming this looks ok for you? Many thanks :)

Fri, Apr 19, 9:50 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou added a comment to T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

// Current datasource - normally the problem is present in here
val odf = spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
val oudf = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
Fri, Apr 19, 9:49 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics

Wed, Apr 17

JAllemandou added a comment to T161149: Provide edit tags in the Data Lake edit data.

Hi @Neil_P._Quinn_WMF - Test data is available :)
Here is an example of accessing it in scala-spark2:

val history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
history.where("event_entity = 'revision' and wiki_db = 'enwiki' and revision_tags is not null and size(revision_tags) > 0").select("event_timestamp", "revision_id", "revision_tags").show(100, false)
Wed, Apr 17, 4:06 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T211950: Add partial blocks to mediawiki history tables.

Hi @nettrom_WMF - I have a test dataset for you that include this data (example in scala-spark2:

val user_history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
user_history..where("caused_by_event_type = 'alterblocks' and wiki_db = 'itwiki' and start_timestamp like '2019-01%' and source_log_params['sitewide'] = 'false' and source_log_params['7::restrictions'] is not null").select("source_log_params").show(100, false)
Wed, Apr 17, 3:58 PM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a comment to T210844: Generate article recommendations in Hadoop for use in production.

Actually it'll not finish - We just killed it as we need to restart the cluster (planned maintenance - see https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html). Sorry for that :( Hopefully you'll still have enough logs.

Wed, Apr 17, 2:23 PM · Patch-For-Review, Analytics-Kanban, Article-Recommendation, Research, Analytics
JAllemandou added a comment to T210844: Generate article recommendations in Hadoop for use in production.

Hi @bmansurov,
I've been monitoring the current run of the recommender (https://yarn.wikimedia.org/proxy/application_1553764233554_69057/), and I think the approach in term of cluster usage is not sustainable. Particularly, generating the pageviews datasets for the top 50 languages should be done in a single pass, and written in partitioned folders (using partitionBy). Currently, there are 50 runs each reading the whole pageview needed data multiple times - This represents reading 50 *2 *1Tb of data. Doing it in one pass will save a lot of IOs as well as a lot of time :)

Wed, Apr 17, 2:06 PM · Patch-For-Review, Analytics-Kanban, Article-Recommendation, Research, Analytics
JAllemandou added a comment to T206894: Set up automated email to report completion of mediawiki_history snapshot and Druid loading.

@Neil_P._Quinn_WMF : Indeed the validation job failed for expected reasons (a higher than usual group-bot-removal, a dimension against which our validation is not very stable). We restarted the job manually with higher threshold: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0171576-181112144035577-oozie-oozi-C
In this job, the email was configured to be sent to multiple emails, but it seems it failed (I didn't received it either): https://hue.wikimedia.org/oozie/list_oozie_workflow_action/0171577-181112144035577-oozie-oozi-W%40send_success_email/?coordinator_job_id=0171576-181112144035577-oozie-oozi-C.
Let's double check again next month if the email gets sent.

Wed, Apr 17, 12:55 PM · Patch-For-Review, Analytics-Kanban, Analytics, Contributors-Analysis, Product-Analytics

Tue, Apr 16

JAllemandou created T221114: Improve mediawiki-user-history bot-by-name regex.
Tue, Apr 16, 5:18 PM · Analytics

Mon, Apr 15

JAllemandou renamed T219177: Add user_is_bot_by to MediaWiki history from Add user_is_bot_by_group to MediaWiki history to Add user_is_bot_by to MediaWiki history.
Mon, Apr 15, 2:03 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T219177: Add user_is_bot_by to MediaWiki history from Next Up to In Progress on the Analytics-Kanban board.
Mon, Apr 15, 2:03 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou claimed T219177: Add user_is_bot_by to MediaWiki history.
Mon, Apr 15, 2:03 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou added a comment to T219177: Add user_is_bot_by to MediaWiki history.

Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).

Mon, Apr 15, 2:02 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

I did a quick analysis using Spark on user data after @nettrom_WMF comment:

import com.databricks.spark.avro._
// user table data
val u = spark.read.avro("/wmf/data/raw/mediawiki/tables/user/snapshot=2019-03").
        select("wiki_db", "user_id", "user_registration")
// logging table data
val l = spark.read.avro("/wmf/data/raw/mediawiki/tables/logging/snapshot=2019-03").
        where("log_type = 'newusers' and log_user is not null and log_user > 0").
        select("wiki_db", "log_user", "log_timestamp")
// joined data on wiki_db and user_id
val j = u.join(l, u("wiki_db") === l("wiki_db") && u("user_id") === l("log_user")).cache()
Mon, Apr 15, 8:01 AM · Analytics-Kanban, Analytics

Fri, Apr 12

JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

Thanks for your comment @nettrom_WMF - I should have explained the plan more thoroughtly.
In the next changes for mediawiki-history, we will add fields for pages and users, ending up in having pageCreationTimestamp and pageFirstEditTimestamp coherent by page_id for each page-event, and similarly for users.
the precise definition of how those values is as follow:

  • pageCreationTimestamp - Timestamp of the page-create event in logging table if it exists, null otherwise.
  • pageFirstEditTimestamp - Timestamp of the oldest revision associated to the page (by page_id), whether in revision or archive table.
  • userCreationTimestamp - oldest from user_registration (in user-table) and user-create event in logging table (if both exist, otherwise the existing one if only one exist, otherwise null).
  • userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.
Fri, Apr 12, 11:24 AM · Analytics-Kanban, Analytics

Thu, Apr 11

JAllemandou moved T220111: Refactor druid data deletion script from In Progress to In Code Review on the Analytics-Kanban board.
Thu, Apr 11, 1:38 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T220507: Decide: start_timestamp for mediawiki history.

Ping @Neil_P._Quinn_WMF and @nettrom_WMF - I'll move forward with the suggested implementation this end of week to have it tested next week :)

Thu, Apr 11, 10:11 AM · Analytics-Kanban, Analytics

Tue, Apr 9

JAllemandou updated the task description for T220507: Decide: start_timestamp for mediawiki history.
Tue, Apr 9, 2:36 PM · Analytics-Kanban, Analytics

Mon, Apr 8

JAllemandou added a comment to T218901: Track number of Wikidata edits by namespace.

Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :)

Mon, Apr 8, 12:26 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Shape Expressions Sprint 5, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year with mostly daily granularity. I have made the following patch to restrict timespan of per-page requests to 1 year: https://gerrit.wikimedia.org/r/c/analytics/aqs/+/502198

Mon, Apr 8, 11:35 AM · Patch-For-Review, Analytics-Kanban, Analytics

Fri, Apr 5

JAllemandou updated the task description for T220111: Refactor druid data deletion script.
Fri, Apr 5, 4:48 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou renamed T220111: Refactor druid data deletion script from Fix druid-public drop-snapshot script to Refactor druid data deletion script.
Fri, Apr 5, 4:46 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T220111: Refactor druid data deletion script.
Fri, Apr 5, 11:59 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T220111: Refactor druid data deletion script from Next Up to In Progress on the Analytics-Kanban board.
Fri, Apr 5, 11:59 AM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, Apr 4

JAllemandou moved T220012: Enable pagecount-ez cron on stats boxes from Next Up to Done on the Analytics-Kanban board.
Thu, Apr 4, 3:58 PM · Analytics, Analytics-Kanban
JAllemandou created T220111: Refactor druid data deletion script.
Thu, Apr 4, 1:59 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T218901: Track number of Wikidata edits by namespace.

Reading about this - Would delayed data be interesting? This information is accessible in hadoop :)

Thu, Apr 4, 10:44 AM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Shape Expressions Sprint 5, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T204965: Create report for "articles with most contributors" in Wikistats2.
Thu, Apr 4, 10:39 AM · Patch-For-Review, Analytics-Wikistats, Analytics-Kanban, Analytics
JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.
Thu, Apr 4, 7:29 AM · Patch-For-Review, Analytics-Kanban, Analytics

Wed, Apr 3

JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

Shall I send a PR updating rate-limiting in restbase for edits-per-page requests to 10?
https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L1551

Wed, Apr 3, 5:17 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T215550: Test sqooping from the new dedicated labsdb host from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Apr 3, 2:20 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T219910: AQS alerts due to big queries issued to Druid for the edit API.

I think the problem experienced yesterday could have been prevented by T189623.
Data backup:

Wed, Apr 3, 12:32 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

We prefer the ISO-8601 strings for serialization everywhere.

Wed, Apr 3, 12:17 PM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review, Analytics, Product-Analytics

Thu, Mar 28

JAllemandou moved T219484: Fix mediawiki-history-checker after field rename from Next Up to In Code Review on the Analytics-Kanban board.
Thu, Mar 28, 8:16 AM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T219484: Fix mediawiki-history-checker after field rename.
Thu, Mar 28, 8:16 AM · Patch-For-Review, Analytics-Kanban
JAllemandou created T219484: Fix mediawiki-history-checker after field rename.
Thu, Mar 28, 8:12 AM · Patch-For-Review, Analytics-Kanban

Wed, Mar 27

JAllemandou moved T167608: Add caused_by_user_text to mediawiki_page_history from In Progress to Ready to Deploy on the Analytics-Kanban board.
Wed, Mar 27, 4:28 PM · Analytics-Kanban, Analytics
JAllemandou moved T206883: mediawiki_history datasets have null user_text for IP edits from In Progress to Ready to Deploy on the Analytics-Kanban board.
Wed, Mar 27, 4:28 PM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou moved T161149: Provide edit tags in the Data Lake edit data from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Wed, Mar 27, 4:27 PM · Analytics-Kanban, Analytics

Tue, Mar 26

JAllemandou created T219326: Update grouped-wiki files for sqoop .
Tue, Mar 26, 8:54 PM · Analytics
JAllemandou added a comment to T209655: Copy Wikidata dumps to HDFs.

Most of the complicated things already exist for this to work (equicalent of rsync for HDFS, spark job converting wikidata json dumps to parquet).
I wanted for T216160 to be settled before moving into productionization (having the same date for the various dumps we handle simplifies quite a bit), and it takes time.

Tue, Mar 26, 7:41 PM · Wikidata, Research, Analytics

Sat, Mar 23

JAllemandou added a comment to T178587: Update wikimedia-history revision data with deleted field (and find it a new name?).

Looks like we have a winner:

  • page_is_deleted
  • revision_is_deleted_by_page_deletion
  • revision_deleted_parts
  • revision_deleted_parts_are_suppressed

@Neil_P._Quinn_WMF and @Milimetric - Confirmation?

Sat, Mar 23, 9:00 AM · Analytics-Kanban, Patch-For-Review, Analytics

Mar 22 2019

JAllemandou added a comment to T178591: Feedback on hive table mediawiki_history by Erik Z.

the bot_by_name and bot_by_group terminology is the one we decided to use a while ago - Let's see if others have opinion :)

Mar 22 2019, 5:20 PM · Analytics, Analytics-Wikistats
JAllemandou added a comment to T178591: Feedback on hive table mediawiki_history by Erik Z.

@Nuria: we set user_is_bot_by_name using a regex: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/user/UserEventBuilder.scala#L27
This represents users having a name that looks like a bot.
user_is_bot_by_group would be the equivalent of WHERE array_contains(user_groups, 'bot'), meaning the user is flagged as bot in mediawiki groups system.

Mar 22 2019, 4:59 PM · Analytics, Analytics-Wikistats
JAllemandou added a comment to T178587: Update wikimedia-history revision data with deleted field (and find it a new name?).

And after saying it, and saying it again, I'm actually ok to go with revision_is_page_deleted, even I think it is confusing (sorry for the hard-no above...).
It looks like every solution we've encountered is confusing to some extent, so one confusion or another...
My increasing confusion order is:

  1. page_is_deleted, revision_is_deleted, revision_hidden_parts and revision_hidden_parts_suppressed
  2. page_is_deleted, revision_is_archived, revision_deleted_parts and revision_deleted_parts_suppressed
  3. page_is_deleted, revision_deleted_by_page_deletion, revision_deleted_parts and revision_deleted_parts_suppressed
  4. page_is_deleted, revision_is_page_deleted, revision_deleted_parts and revision_deleted_parts_suppressed
Mar 22 2019, 4:49 PM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou added a comment to T213603: Coordinate work on minor changes for Edit Data Quality.

The validation checks above are done without the patch for page-history refactor (https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/493390) because it still generates erroneous data.
Not tested: page events are joined by id, explicit page-create events are used.

Mar 22 2019, 11:14 AM · Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T213603: Coordinate work on minor changes for Edit Data Quality.

Data check details:

Mar 22 2019, 11:12 AM · Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T178591: Feedback on hive table mediawiki_history by Erik Z.

Only thing remaining here that has not been worked is a field about being a bot or not instead of relying on the groups.
We have user_is_anonymous and user_is_bot_by_name - We could add user_is_bot_by_group.
@Milimetric, good for you?

Mar 22 2019, 10:43 AM · Analytics, Analytics-Wikistats
JAllemandou added a comment to T185342: Wikistats 2: New Pages split by editor type wrongly claims no anonymous users create pages.

Yes, there is a patch to fix this in the base data: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/491494
Not sure when this will get deployed, and some more checks are needed on wikistats to ensure the change is visible.

Mar 22 2019, 10:41 AM · Analytics, Analytics-Wikistats

Mar 21 2019

JAllemandou moved T215550: Test sqooping from the new dedicated labsdb host from In Progress to In Code Review on the Analytics-Kanban board.
Mar 21 2019, 5:06 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated subscribers of T218824: A few alterblocks events have event_timestamps from before 2001.

Thanks @Neil_P._Quinn_WMF and @matmarex :)
This error is indeed related to wrong filtering of weird block-expirations.
A patch fixing this and more should come in the relatively near future, as there are other issues with user-alterblocks events.

Mar 21 2019, 8:19 AM · Analytics, Analytics-Data-Quality, Product-Analytics

Mar 20 2019

JAllemandou added a comment to T178587: Update wikimedia-history revision data with deleted field (and find it a new name?).

I disagree! I think that the confusing similarity between revision deletion and page deletion is part of reality, so we should not try to hide it in this dataset or introduce our own unique terminology which confuses things further. revision_is_deleted naturally seems like the equivalent of rev_deleted in the source data set, referring to revision deletion. How about revision_is_page-deleted? That clarifies that we're talking about the feature called "page deletion", not about the page having the status "deleted".

Mar 20 2019, 8:55 PM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou added a comment to T218758: Improve speed and reliability of Yarn's Resource Manager failover.

So we have a history of 4 days, more or less, at any given time. Not sure if this also translates in what logs are retained and for how long (edit: seems to be 3h for yarn.nodemanager.log.retain-seconds's default).

Mar 20 2019, 1:26 PM · Patch-For-Review, Analytics, Analytics-Kanban

Mar 19 2019

JAllemandou updated subscribers of T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history.

Thanks Neil for having raised this.
I have found 3 issues:

  • One is that there was an inconsistency between user create events start_timestampand the user_registration_timestamp, leading to users not being correctly linked to other events (this is the case for S7w4j9 and SheriffsIsInTown in the given examples above). See this patch: https://gerrit.wikimedia.org/r/#/c/497604/
  • The second is that by approximating user registration with its first-edit when registration is undefined, we miss the opportunity to link the user to its real create event happening a lot before the actual first edit (example of Rovack above). We should make an explicit distinction between registration (either defined in user-page or through create-event), and user first-edit timestamp (such a distinction is already coming to pages, it makes sense to add it for users as well).
  • Finally the user first-edit date was computed using revisions and not archive, leading to some archive rows not correctly attached to the user if before the first revision ( corrected but not not yet deployed: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/491494)
Mar 19 2019, 8:47 PM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Data-Quality, Product-Analytics
JAllemandou renamed T218130: Update mediawiki-history subgraph-partitioner so that it uses [page/user]_id in addition to title/text from Update mediawiki-history subgraph-partitioner so that it uses page_id for pages to Update mediawiki-history subgraph-partitioner so that it uses [page/user]_id in addition to title/text.
Mar 19 2019, 8:20 PM · Analytics
JAllemandou added a comment to T144100: Pageview dumps incorrectly formatted, need to escape special characters.

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

+1, these changes are to happen on pageviewdefinition java classes rather than HIVE.

We should confirm that white-space chars are not accepted in regular page titles before adding this to the main function. If they are accepted, keeping them in parquet format is probably a good idea, and the filtering should happen in hive. If not accepted, ok to remove them at page-title extraction :)

Mar 19 2019, 5:13 PM · Analytics-Kanban, Patch-For-Review, good first bug, Datasets-General-or-Unknown, Security, Analytics
JAllemandou added a comment to T144100: Pageview dumps incorrectly formatted, need to escape special characters.

Reading HIVE-5672, the root bug has been fixed

Mar 19 2019, 9:02 AM · Analytics-Kanban, Patch-For-Review, good first bug, Datasets-General-or-Unknown, Security, Analytics
JAllemandou added a comment to T144100: Pageview dumps incorrectly formatted, need to escape special characters.

thanks for working on this @awight :)

Mar 19 2019, 8:37 AM · Analytics-Kanban, Patch-For-Review, good first bug, Datasets-General-or-Unknown, Security, Analytics
JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

Absolutely right @Ottomata - We wanted to use Timestamps type to facilitate applying functions, but it was not feasible because of Parquet, so we went for the string format that allowed those functions.

Mar 19 2019, 8:24 AM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review, Analytics, Product-Analytics

Mar 15 2019

JAllemandou added a comment to T214897: data for analyzing and visualizing the identifier landscape of Wikidata.

Hey @GoranSMilovanovic - I don't have a good understanding of what you're after, but having read pairs and contingency table above, maybe this Spark function could be helpful: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/DataFrameStatFunctions.html

Mar 15 2019, 11:22 PM · WMDE-Analytics-Engineering, User-GoranSMilovanovic, Wikidata
JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

Hi Folks, I'll tr to provide more infoon Hive Timestamps and related formats.
There are (at least) two considerations when dealing with Hive datasets: the metastore (schema handler) and the file format (how data is actually stored/retrieved). In classical datastores, there is no such distinction, and therefore Timestamp type being available means usable. In hive, the Timestamp type is available in the metastore and in default-supported file formats, but with the version of Hive we have there is a conversion problem with Parquet (https://issues.apache.org/jira/browse/HIVE-9482). This is the reason why we use String and not Timestamp in Mediawiki-history.

Mar 15 2019, 8:02 AM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review, Analytics, Product-Analytics

Mar 14 2019

JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles.

Mar 14 2019, 8:19 AM · Patch-For-Review, WikiCite, Analytics, Dumps-Generation, Wikidata

Mar 12 2019

JAllemandou updated the task description for T213603: Coordinate work on minor changes for Edit Data Quality.
Mar 12 2019, 5:14 PM · Patch-For-Review, Analytics-Kanban
JAllemandou created T218130: Update mediawiki-history subgraph-partitioner so that it uses [page/user]_id in addition to title/text.
Mar 12 2019, 5:14 PM · Analytics
JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

Nice! I don't see flaws in this approach @Ottomata, but before testing I don't want to say it'll work for all cases (timestamps work for us in hive, but not in parquet for instance).
Now how we move forward depending on the upgrade time is another thing.

Mar 12 2019, 1:56 PM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review, Analytics, Product-Analytics
JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

@Ottomata : I don't think Hive timestamp-functions will parse ISO by default. All functions doc gives examples without the 'T'.
Also, the above test with hive is also true in Spark.
When we upgrade to newer hive, we will be able to use hive timestamps, meaning we won't have strings anymore, and therefore format will not be that important because transformation will be done once at refine stage. For now, using format without 'T' facilitates usage a lot IMO.

Mar 12 2019, 1:35 PM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review, Analytics, Product-Analytics
JAllemandou updated the task description for T213603: Coordinate work on minor changes for Edit Data Quality.
Mar 12 2019, 10:43 AM · Patch-For-Review, Analytics-Kanban

Mar 11 2019

JAllemandou added a comment to T212529: Standardize datetimes/timestamps in the Data Lake.

I have an opinion and some information here.
We use string-encoded timestamps in mediawiki-history because our version of hive doesn't support timestamps in parquet (see https://issues.apache.org/jira/browse/HIVE-6384).
The reason we chose the SQL format instead of the ISO one is because Hive UDF work as-is for the former, not the latter:

select dt from event.navigationtiming where year = 2019 and month = 3 and day = 11 and hour = 0 limit 1;
2019-03-11T00:10:45Z
Mar 11 2019, 7:58 PM · MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review, Analytics, Product-Analytics
JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

Following up on this: another viable solution to get monthly-coherence between dumps is to force a dump on the 1st of the month ... I'm not sure the idea is better.
@ArielGlenn - How do we proceed to try moving forward (in either direction) ?

Mar 11 2019, 7:35 PM · Patch-For-Review, WikiCite, Analytics, Dumps-Generation, Wikidata
JAllemandou added a comment to T189044: Mediawiki History: moves counted twice in Revision.

IMO this issue is not data-quality as in a problem in the dataset generation, but rather a problem of data-semantics and how we interpret our data. By this I mean the effort on data quality we made this quarter is not related to this issue.

Mar 11 2019, 6:30 PM · Analytics

Mar 8 2019

JAllemandou claimed T215550: Test sqooping from the new dedicated labsdb host.
Mar 8 2019, 5:23 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T215550: Test sqooping from the new dedicated labsdb host from Next Up to In Progress on the Analytics-Kanban board.
Mar 8 2019, 5:23 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T216105: yearly labels in wikistats say 2017 from Ready to Deploy to Done on the Analytics-Kanban board.
Mar 8 2019, 2:52 PM · Patch-For-Review, Analytics, Analytics-Kanban

Mar 7 2019

JAllemandou added a comment to T211950: Add partial blocks to mediawiki history tables.

Hi folks - Thanks again for quick answers - My superbad I looked at the wrong. I confirm data is available.
A first toward having it available in mediawiki-history is already in CR (https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/493012). This will (should) make the logParams available in the page-history table, with all its information.
I'll keep this ticket open to further improve how we handle detailed data.

Mar 7 2019, 7:25 PM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a comment to T211950: Add partial blocks to mediawiki history tables.

Hi @nettrom_WMF,
ipblocks_restrictions table is sqooped since this month on the cluster.
However I think that logging table doesn't contain detailed historical information on partial blocks. This will prevent us to rebuild historical (and therefore more interesting) information on partial-blocks.
Can you talk to your team see if more detailed logging could be developped?
Cheers
Joseph

Mar 7 2019, 6:14 PM · Analytics-Kanban, Product-Analytics, Anti-Harassment, Analytics
JAllemandou added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

Thanks for the comments and the announcement @Bstorm . For the record, we have experienced problems with hewiki and eswiki, so I think the changes have applied for those two.

Mar 7 2019, 5:20 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
JAllemandou added a comment to T217821: Investigate duplication of strings in wb_terms table for wikidatawiki.

Exact analysis ran on 2018-12-06:

val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001")
val base_rdd = df.select("labels", "descriptions", "aliases").rdd
val strings = base_rdd.flatMap(r => {
  r.getMap[String,String](0).values ++
  r.getMap[String,String](1).values ++
  r.getMap[String,Seq[String]](2).values.flatMap(l => l)
})
Mar 7 2019, 9:35 AM · User-Addshore, Wikidata
JAllemandou added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

Hi @Bstorm , given the merged patches above, I'm assuming comment-data is fully moved toward the comment table, and that we can (must) remove references to the comment-related fields in the data we gather from labsdb. Can you confirm? Many thanks :)

Mar 7 2019, 9:04 AM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)

Mar 6 2019

JAllemandou updated the task description for T217792: Add wikitech (labswiki) to the sqoop list.
Mar 6 2019, 6:34 PM · Analytics
JAllemandou added a subtask for T167973: Move wikitech and labstestwiki to s5: T217792: Add wikitech (labswiki) to the sqoop list.
Mar 6 2019, 6:32 PM · wikitech.wikimedia.org, DBA
JAllemandou added parent tasks for T217792: Add wikitech (labswiki) to the sqoop list: T167973: Move wikitech and labstestwiki to s5, T171570: Rename database labswiki to wikitech.
Mar 6 2019, 6:32 PM · Analytics
JAllemandou added a subtask for T171570: Rename database labswiki to wikitech: T217792: Add wikitech (labswiki) to the sqoop list.
Mar 6 2019, 6:32 PM · DBA, wikitech.wikimedia.org
JAllemandou created T217792: Add wikitech (labswiki) to the sqoop list.
Mar 6 2019, 6:31 PM · Analytics

Mar 5 2019

JAllemandou moved T205594: mediawiki_history missing page events from In Progress to In Code Review on the Analytics-Kanban board.
Mar 5 2019, 4:24 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou moved T214490: page_creation_timestamp not always correct in mediawiki_history from In Progress to In Code Review on the Analytics-Kanban board.
Mar 5 2019, 4:24 PM · Analytics-Kanban, Product-Analytics, Analytics-Data-Quality, Analytics