Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (262 w, 4 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Tue, Feb 18

JAllemandou moved T238432: Fix non MapReduce execution of GeoCode UDF from Paused to In Code Review on the Analytics-Kanban board.
Tue, Feb 18, 11:58 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T245496: Fix wikitext-generation jobs (use 0.0.114 jar) from Next Up to In Code Review on the Analytics-Kanban board.
Tue, Feb 18, 11:35 AM · Analytics-Kanban, Analytics
JAllemandou claimed T245496: Fix wikitext-generation jobs (use 0.0.114 jar).
Tue, Feb 18, 11:34 AM · Analytics-Kanban, Analytics
JAllemandou created T245496: Fix wikitext-generation jobs (use 0.0.114 jar).
Tue, Feb 18, 11:34 AM · Analytics-Kanban, Analytics
JAllemandou renamed T209655: Copy Wikidata dumps to HDFS + parquet from Copy Wikidata dumps to HDFS to Copy Wikidata dumps to HDFS + parquet.
Tue, Feb 18, 11:33 AM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou added a comment to T238858: Make history and current wikitext available in hadoop.

It's been done a few days ago :)
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/XMLDumps/Mediawiki_wikitext_current exists as well.

Tue, Feb 18, 8:33 AM · Analytics-Kanban, Analytics

Mon, Feb 17

JAllemandou moved T244707: Productionize item_page_link table from In Progress to In Code Review on the Analytics-Kanban board.
Mon, Feb 17, 10:03 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T245453: Fix webrequest host normalization.

I tried to validate the approach of using normalized_host.project and normalized_host.project_family instead of pageview_info[project] for @Ladsgroup...

Mon, Feb 17, 8:18 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T245453: Fix webrequest host normalization from Next Up to In Code Review on the Analytics-Kanban board.
Mon, Feb 17, 6:00 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T245453: Fix webrequest host normalization.
Mon, Feb 17, 5:58 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T245453: Fix webrequest host normalization.
Mon, Feb 17, 5:58 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T245373: Optimization tips and feedback.

About the pageview query, @Nuria is right: partition predicates need to use single partition fields and simple definition (the same as what you do in comments) Otherwise the engine doesn't manage to get which partitions to read or not, and therefore reads all, and then filter based on their values.

Mon, Feb 17, 7:12 AM · Analytics

Thu, Feb 13

JAllemandou renamed T245151: Remove tranquility and banner-impressions streaming from refinery-job from Remove transui;ity and banner-impression from refinery-job to Remove tranquility and banner-impressions streaming from refinery-job.
Thu, Feb 13, 2:14 PM · Analytics-Kanban, Analytics
JAllemandou moved T245151: Remove tranquility and banner-impressions streaming from refinery-job from Next Up to In Code Review on the Analytics-Kanban board.
Thu, Feb 13, 2:13 PM · Analytics-Kanban, Analytics
JAllemandou set Final Story Points to 1 on T245151: Remove tranquility and banner-impressions streaming from refinery-job.
Thu, Feb 13, 2:13 PM · Analytics-Kanban, Analytics
JAllemandou claimed T245151: Remove tranquility and banner-impressions streaming from refinery-job.
Thu, Feb 13, 2:12 PM · Analytics-Kanban, Analytics
JAllemandou created T245151: Remove tranquility and banner-impressions streaming from refinery-job.
Thu, Feb 13, 2:12 PM · Analytics-Kanban, Analytics
JAllemandou created T245126: Delete raw events for mediawiki_<event> , the refined data is kept indefinately.
Thu, Feb 13, 9:45 AM · Analytics
JAllemandou added a comment to T244889: HDFS space usage steadily increased over the past three month.

After a (not so) quick audit, growth usage is due to the problem described in T245124.

Thu, Feb 13, 9:42 AM · Analytics-Kanban, Analytics

Wed, Feb 12

JAllemandou added a comment to T245040: Request for instructions for using DataGrip in the Kerberos paradigm.

Thinking of the future, if we decide presto is the way to go for analysts, Airpal seems a good candidate.
If Hive is really needed, I found yanagishima.

Wed, Feb 12, 7:52 PM · Product-Analytics, Analytics

Tue, Feb 11

JAllemandou added a comment to T243934: Unify puppet roles for stat and notebook hosts.

Things to consider IMO while doing this work: how do we separate regular running jobs so that

  • Host can handle them (computation/RAM)
  • They are not subject to user jobs
  • There is (at least some) separation of concerns
  • We understand it (at least some as well)
Tue, Feb 11, 1:39 PM · Analytics
JAllemandou added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

Hi @ArielGlenn, I tested reading yowiki-20200101-pages-articles-multistream.xml.bz2 successfully :)
Thanks for the heads-up :)

Tue, Feb 11, 8:33 AM · Dumps-Generation
JAllemandou added a comment to T244807: Mediawiki History Dumps - Possible parsing issue.

Hi @leila ,
I have checked some rows, and there doesn't seem to be any parsing error as comment values are the same in mysql, sqooped-data and processed data for the rows I checked.
Also, when looking at revisions having happened the same day as the one holding the comment you mention, that comment pattern is recurrent:

spark.sql("select event_timestamp, revision_id, event_comment from wmf.mediawiki_history where snapshot = '2020-01' and wiki_db = 'yowiki' and event_entity = 'revision' and date(event_timestamp) = '2012-04-04'").show(1000, false)
Tue, Feb 11, 8:23 AM · Analytics, Analytics-Wikistats

Mon, Feb 10

JAllemandou moved T244707: Productionize item_page_link table from Next Up to In Progress on the Analytics-Kanban board.
Mon, Feb 10, 10:22 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T244707: Productionize item_page_link table.
Mon, Feb 10, 10:20 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T244707: Productionize item_page_link table.
Mon, Feb 10, 10:19 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T209655: Copy Wikidata dumps to HDFS + parquet from In Progress to Ready to Deploy on the Analytics-Kanban board.
Mon, Feb 10, 10:14 AM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou moved T209655: Copy Wikidata dumps to HDFS + parquet from In Code Review to In Progress on the Analytics-Kanban board.
Mon, Feb 10, 8:10 AM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou moved T209655: Copy Wikidata dumps to HDFS + parquet from Done to In Code Review on the Analytics-Kanban board.
Mon, Feb 10, 8:10 AM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou moved T209655: Copy Wikidata dumps to HDFS + parquet from In Code Review to Done on the Analytics-Kanban board.
Mon, Feb 10, 8:10 AM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics

Fri, Feb 7

JAllemandou closed T244504: Request for database on hadoop user space as Resolved.
Fri, Feb 7, 1:04 PM · Analytics
JAllemandou added a comment to T244484: Issues querying table in Hive.

Nothing to add to what Andrew said. Adding partitions with folders not following the pattern convention needs to be done 'manually' (can be done through a script, but with explicit single partition commands).

Fri, Feb 7, 1:01 PM · Analytics

Thu, Feb 6

JAllemandou closed T244292: Database creation in Hive as Resolved.
Thu, Feb 6, 6:31 PM · Analytics
JAllemandou added a comment to T244484: Issues querying table in Hive.

Hi again @EYener,
Partitions in hive are a SQL representation for folders. Adding a partition only tells hive that it should look into a folder to find files related to the values of the partition (for instance, information relative to partition(year=2019, month=12, day=02, hour=16) is in folder /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16) This scheme allows to filter the amount of data that is read for queries on big tables.
Something else to know is that the default way for hive to encode partition information in folder is field=value (for instance /my/table/year=2020/month=12/...). But hive also allows to link partitions with path not field-value structured (as in the first example).
Finally with that, adding a partition doesn't give you any information about data size as it is only linking partition values to a folder. The easiest to get some information on data sie without querying is to look at folder size. For instance on a stat machine:

hdfs dfs -du -s -h /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16
10.9 M  32.8 M  /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16
Thu, Feb 6, 6:26 PM · Analytics
JAllemandou claimed T244292: Database creation in Hive.
Thu, Feb 6, 6:13 PM · Analytics
JAllemandou added a comment to T244292: Database creation in Hive.

Hi @EYener,
You can setup the database yourself: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive#Create_your_own_database
Also if you're new to hive, I suggest you have a look at the page above and related pages :)

Thu, Feb 6, 6:13 PM · Analytics
JAllemandou added a comment to T244504: Request for database on hadoop user space.

Hi @jkumalah,
you should actually be able to create the table yourself. I took the ooportunity to write https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive#Create_your_own_database .
You can also come and ping us on irc chan #wikimedia-analytics :)

Thu, Feb 6, 6:08 PM · Analytics
JAllemandou added a comment to T244484: Issues querying table in Hive.
  • About the query:
use cps;
show partitions centralnoticebannerhistory20191202;
Thu, Feb 6, 2:51 PM · Analytics

Wed, Feb 5

JAllemandou moved T243427: Mediawiki history public release: tsv format is not correctly parsable from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Feb 5, 9:58 PM · Analytics-Kanban, Analytics
JAllemandou moved T241375: The guava error still persists in data quality bundles from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Feb 5, 9:58 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T238858: Make history and current wikitext available in hadoop from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Feb 5, 9:58 PM · Analytics-Kanban, Analytics
JAllemandou moved T243426: Mediawiki history documentation for public dataset release from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Feb 5, 9:58 PM · Analytics-Kanban, Analytics
JAllemandou added a subtask for T186559: Provide data dumps in the Analytics Data Lake: T238858: Make history and current wikitext available in hadoop.
Wed, Feb 5, 9:53 PM · Analytics
JAllemandou added a parent task for T238858: Make history and current wikitext available in hadoop: T186559: Provide data dumps in the Analytics Data Lake.
Wed, Feb 5, 9:53 PM · Analytics-Kanban, Analytics
JAllemandou moved T243832: Fix hdfs-rsync`prune-empty-dirs` feature from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Feb 5, 8:36 PM · Analytics-Kanban, Analytics
JAllemandou renamed T238858: Make history and current wikitext available in hadoop from Update wikitext-processing on hadoop various aspects to Make history and current wikitext available in hadoop.
Wed, Feb 5, 8:34 PM · Analytics-Kanban, Analytics
JAllemandou triaged T243832: Fix hdfs-rsync`prune-empty-dirs` feature as High priority.
Wed, Feb 5, 5:59 PM · Analytics-Kanban, Analytics
JAllemandou moved T243832: Fix hdfs-rsync`prune-empty-dirs` feature from Incoming to Ops Week on the Analytics board.
Wed, Feb 5, 5:59 PM · Analytics-Kanban, Analytics
JAllemandou created T244380: Convert siteinfo dumps from json to parquet.
Wed, Feb 5, 5:23 PM · Analytics
JAllemandou claimed T238858: Make history and current wikitext available in hadoop.
Wed, Feb 5, 5:23 PM · Analytics-Kanban, Analytics
JAllemandou moved T238858: Make history and current wikitext available in hadoop from Next Up to Ready to Deploy on the Analytics-Kanban board.
Wed, Feb 5, 5:17 PM · Analytics-Kanban, Analytics
JAllemandou moved T243832: Fix hdfs-rsync`prune-empty-dirs` feature from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Wed, Feb 5, 5:16 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T242844: Release data from a public health related research conducted by WMF and formal collaborators.

Moving the zika-research folder to one-off is a good idea, as the latter already contains research datasets as you noticed. keeping it as-is is also not problematic. I let you decide @Miriam :)

Wed, Feb 5, 1:20 PM · Security, Privacy Engineering, Privacy, Research, Analytics

Tue, Feb 4

JAllemandou moved T243426: Mediawiki history documentation for public dataset release from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Tue, Feb 4, 2:25 PM · Analytics-Kanban, Analytics
JAllemandou updated subscribers of T242844: Release data from a public health related research conducted by WMF and formal collaborators.

Hi @Miriam :)
From stat1007, directories are synced from /srv/published/datasets to https://analytics.wikimedia.org/published/datasets/
There is more in the web URL than in the folder as data is synced from various places to the same output URL.
I think creating a reserarch subfolder, and then related folders for your project should be ok.
@Ottomata anything I missed?

Tue, Feb 4, 2:18 PM · Security, Privacy Engineering, Privacy, Research, Analytics
JAllemandou moved T209655: Copy Wikidata dumps to HDFS + parquet from In Progress to In Code Review on the Analytics-Kanban board.
Tue, Feb 4, 11:03 AM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics

Tue, Jan 28

JAllemandou added a subtask for T209655: Copy Wikidata dumps to HDFS + parquet: T243832: Fix hdfs-rsync`prune-empty-dirs` feature.
Tue, Jan 28, 2:42 PM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou added a parent task for T243832: Fix hdfs-rsync`prune-empty-dirs` feature: T209655: Copy Wikidata dumps to HDFS + parquet.
Tue, Jan 28, 2:42 PM · Analytics-Kanban, Analytics
JAllemandou moved T209655: Copy Wikidata dumps to HDFS + parquet from Next Up to In Progress on the Analytics-Kanban board.
Tue, Jan 28, 2:41 PM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou claimed T209655: Copy Wikidata dumps to HDFS + parquet.
Tue, Jan 28, 2:41 PM · Analytics-Kanban, Patch-For-Review, Research-Backlog, Wikidata, Analytics
JAllemandou moved T243832: Fix hdfs-rsync`prune-empty-dirs` feature from Next Up to In Code Review on the Analytics-Kanban board.
Tue, Jan 28, 1:53 PM · Analytics-Kanban, Analytics
JAllemandou claimed T243832: Fix hdfs-rsync`prune-empty-dirs` feature.
Tue, Jan 28, 1:53 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T243832: Fix hdfs-rsync`prune-empty-dirs` feature.

PR sent : https://github.com/wikimedia/hdfs-tools/pull/7

Tue, Jan 28, 1:52 PM · Analytics-Kanban, Analytics
JAllemandou created T243832: Fix hdfs-rsync`prune-empty-dirs` feature.
Tue, Jan 28, 12:47 PM · Analytics-Kanban, Analytics

Jan 24 2020

JAllemandou moved T243426: Mediawiki history documentation for public dataset release from In Progress to In Code Review on the Analytics-Kanban board.
Jan 24 2020, 5:22 PM · Analytics-Kanban, Analytics
JAllemandou moved T243427: Mediawiki history public release: tsv format is not correctly parsable from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Jan 24 2020, 5:22 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T158889: Explore getting articles ranked by sitelinks before anything else.

is this now possible via spark?

Hi @Iflorez - while my unserstanding of this is not precise enough, I can already say that the wikidata content is not yet present on the cluster in a productionized way.
So possibly the computation is feasible, but a productionized version of it is not, yet.

Jan 24 2020, 8:46 AM · Recommendation-API

Jan 23 2020

JAllemandou moved T243427: Mediawiki history public release: tsv format is not correctly parsable from In Progress to In Code Review on the Analytics-Kanban board.
Jan 23 2020, 10:21 AM · Analytics-Kanban, Analytics
JAllemandou moved T243426: Mediawiki history documentation for public dataset release from Next Up to In Progress on the Analytics-Kanban board.
Jan 23 2020, 10:21 AM · Analytics-Kanban, Analytics

Jan 22 2020

JAllemandou renamed T243427: Mediawiki history public release: tsv format is not correctly parsable from Mediawiki history public release: tsv format needs to quote every field to Mediawiki history public release: tsv format is not correctly parsable.
Jan 22 2020, 7:25 PM · Analytics-Kanban, Analytics
JAllemandou moved T243427: Mediawiki history public release: tsv format is not correctly parsable from Next Up to In Progress on the Analytics-Kanban board.
Jan 22 2020, 7:24 PM · Analytics-Kanban, Analytics
JAllemandou moved T238432: Fix non MapReduce execution of GeoCode UDF from In Progress to Paused on the Analytics-Kanban board.
Jan 22 2020, 7:24 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T241375: The guava error still persists in data quality bundles.

I suggested in an email using a parquet table to handle DataQuality values, and I think that would indeed help.
To read value-separated text file Hive uses org.apache.hadoop.mapred.TextInputFormat, a subclass of org.apache.hadoop.mapred.FileInputFormat.
The TextInputFormat implements only a small number of methods, and particularly it does not override the methods from FileInputFormat that contain the failing StopWatch. Those methods are getSplits (the one we experience failure with) and listStatus.
On the other hand, org.apache.parquet.hadoop.ParquetInputFormat is also a subclass of FileInputFormat, but it overrides the 2 methods above, so it shouldn't fail :)

Jan 22 2020, 7:12 PM · Patch-For-Review, Analytics-Kanban, Analytics

Jan 21 2020

JAllemandou added a comment to T243089: Spike. Try to ML models distributted in jupyter notebooks with dask.

First step done: Having dask running on our cluster.

Jan 21 2020, 8:58 PM · Analytics
JAllemandou added a comment to T243241: Some xml-dumps files don't follow BZ2 'correct' definition.

Patch merged upstream, will be released as part of commons-compress 1.20
Will follow up to try to force use that version once published.

Jan 21 2020, 1:38 PM · Analytics-Kanban, Dumps-Generation, Analytics
JAllemandou added a comment to T243239: Unable to access Hive from notebook1003.

@dr0ptp4kt Thanks for making me seing things correctly :)
I think you need to kdestroy from ssh session to clean this session, then kdestroy and kinit again in notebook terminal to start fresh.

Jan 21 2020, 12:56 PM · Analytics
JAllemandou added a comment to T243241: Some xml-dumps files don't follow BZ2 'correct' definition.

In the meantime, might you be able to use bzip2 to decompress and recompress the problem file(s) so that you can get your import into hadoop done?

Jan 21 2020, 8:38 AM · Analytics-Kanban, Dumps-Generation, Analytics
JAllemandou updated subscribers of T243239: Unable to access Hive from notebook1003.

@Reedy : kerberos ticket needs to be created in the notebook environment, using a notebook terminal page. See https://wikitech.wikimedia.org/wiki/SWAP#Kerberos.

Jan 21 2020, 7:58 AM · Analytics

Jan 20 2020

JAllemandou moved T243241: Some xml-dumps files don't follow BZ2 'correct' definition from Next Up to Paused on the Analytics-Kanban board.
Jan 20 2020, 8:49 PM · Analytics-Kanban, Dumps-Generation, Analytics
JAllemandou claimed T243241: Some xml-dumps files don't follow BZ2 'correct' definition.
Jan 20 2020, 8:49 PM · Analytics-Kanban, Dumps-Generation, Analytics
JAllemandou created T243241: Some xml-dumps files don't follow BZ2 'correct' definition.
Jan 20 2020, 8:49 PM · Analytics-Kanban, Dumps-Generation, Analytics

Jan 17 2020

JAllemandou added a project to T238858: Make history and current wikitext available in hadoop: Analytics-Kanban.
Jan 17 2020, 12:55 PM · Analytics-Kanban, Analytics
JAllemandou triaged T242015: Fix sqoop after changes as High priority.
Jan 17 2020, 12:51 PM · Analytics-Kanban, Analytics
JAllemandou moved T242015: Fix sqoop after changes from Incoming to Ops Week on the Analytics board.
Jan 17 2020, 12:51 PM · Analytics-Kanban, Analytics

Jan 15 2020

JAllemandou updated subscribers of T241375: The guava error still persists in data quality bundles.

Git tells me that we bumped guava from 12 to 18 when adding jsonschema loader.
IIRC the com.github.java-json-tools - json-schema-core dependency need jackson-coreutils which in turn needs guava, and @Ottomata and I experienced errors with version 12.

Jan 15 2020, 1:35 PM · Patch-For-Review, Analytics-Kanban, Analytics

Jan 13 2020

JAllemandou moved T238326: Make hdfs-rsync process sub-folders recursively from In Code Review to Done on the Analytics-Kanban board.
Jan 13 2020, 4:04 PM · Analytics-Kanban, Analytics

Jan 10 2020

JAllemandou moved T238326: Make hdfs-rsync process sub-folders recursively from Done to In Code Review on the Analytics-Kanban board.
Jan 10 2020, 5:18 PM · Analytics-Kanban, Analytics
JAllemandou moved T238326: Make hdfs-rsync process sub-folders recursively from In Code Review to Done on the Analytics-Kanban board.
Jan 10 2020, 8:21 AM · Analytics-Kanban, Analytics

Jan 9 2020

JAllemandou added a comment to T240934: Enable encryption in Spark 2.4 by default.

Latest development on my end:

  • Oozie worked with Luca's patch above
  • spark-submit with python worked as well
  • pyspark2 still fails
Jan 9 2020, 7:16 PM · Patch-For-Review, Analytics-Kanban, Analytics

Jan 8 2020

JAllemandou moved T232659: Add dimensions for Project type & language to Edits_hourly from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2020, 8:50 PM · Analytics-Kanban, Analytics, Patch-For-Review, Product-Analytics (Kanban)
JAllemandou moved T242015: Fix sqoop after changes from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2020, 8:45 PM · Analytics-Kanban, Analytics
JAllemandou moved T238360: Hourly Feature extraction for bot detection from webrequest from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2020, 8:45 PM · Analytics-Kanban, Analytics
JAllemandou moved T236895: ArticlePlaceholder dashboard stopped tracking page views from Next Up to In Code Review on the Analytics-Kanban board.
Jan 8 2020, 8:45 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics-Kanban, User-Ladsgroup, Patch-For-Review, Analytics, wikidata-tech-focus, Wikidata, ArticlePlaceholder
JAllemandou added a project to T236895: ArticlePlaceholder dashboard stopped tracking page views: Analytics-Kanban.
Jan 8 2020, 8:00 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics-Kanban, User-Ladsgroup, Patch-For-Review, Analytics, wikidata-tech-focus, Wikidata, ArticlePlaceholder
JAllemandou added a comment to T236895: ArticlePlaceholder dashboard stopped tracking page views.

The patch merged by @Nuria had a bug. I commented on the already merged patch on a solution. For the moment the job is not started.

Jan 8 2020, 8:00 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics-Kanban, User-Ladsgroup, Patch-For-Review, Analytics, wikidata-tech-focus, Wikidata, ArticlePlaceholder
JAllemandou moved T232659: Add dimensions for Project type & language to Edits_hourly from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Jan 8 2020, 7:11 PM · Analytics-Kanban, Analytics, Patch-For-Review, Product-Analytics (Kanban)
JAllemandou moved T242015: Fix sqoop after changes from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Jan 8 2020, 1:13 PM · Analytics-Kanban, Analytics

Jan 7 2020

JAllemandou added a comment to T242015: Fix sqoop after changes.

@Nuria: No email was sent as I discovered the problem before the SLA limit.
Current limit for mediawiki-history jobs is at 39 days (31 + 8), since sqooping was taking a lot longer before.
We probably could set it to 34 to be more reactive (sqoop has regularly finished in less than 24 hours in the past months).

Jan 7 2020, 10:56 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T240891: Add python urllib_kerberos package to analytics clients.

My use-case is in https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/http.py, in which we currently use urllib.
I'll try to update the file to use requests.

Jan 7 2020, 8:37 AM · User-Elukey, Analytics
JAllemandou added a comment to T241194: Reconciling page creations in Turnilo.

No problem adding page_is_redirect. It should be documented and understood for what it is being my only concern: page_is_redirect is present on every event but its value only concerns the current revision of the page (last one).

Jan 7 2020, 8:10 AM · Analytics, Product-Analytics