Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (251 w, 3 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 5

JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

I looked at data from https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/wmcs/ and I have 2 comments:

  • I find it misleading that wikis_by_wmcs_edits.tsv contains percents without mentioning it in the filename while the other two contain absolute edit values and percents in relation to their names.
  • First of the month is used to mention the named month data - 2019-11-01 means data from 2019-11-01 to 2019-11-30
Thu, Dec 5, 8:16 AM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou moved T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop from Done to In Code Review on the Analytics-Kanban board.
Thu, Dec 5, 8:01 AM · Analytics-Kanban, Analytics

Wed, Dec 4

JAllemandou moved T239471: Sqoop wikidata terms tables into hadoop from In Code Review to Done on the Analytics-Kanban board.
Wed, Dec 4, 7:19 PM · Analytics-Kanban, User-Addshore, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics, Wikidata
JAllemandou moved T239848: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article from Next Up to In Code Review on the Analytics-Kanban board.
Wed, Dec 4, 7:19 PM · Analytics-Kanban, Analytics
JAllemandou updated subscribers of T209655: Copy Wikidata dumps to HDFS.

New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page.

hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1
drwxr-xr-x   - analytics joal          0 2019-12-04 18:31 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202
Wed, Dec 4, 7:13 PM · Research-Backlog, Wikidata, Analytics
JAllemandou claimed T239848: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article.
Wed, Dec 4, 5:43 PM · Analytics-Kanban, Analytics
JAllemandou created T239848: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article.
Wed, Dec 4, 5:43 PM · Analytics-Kanban, Analytics

Tue, Dec 3

JAllemandou moved T239471: Sqoop wikidata terms tables into hadoop from Ready to Deploy to Done on the Analytics-Kanban board.
Tue, Dec 3, 2:53 PM · Analytics-Kanban, User-Addshore, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics, Wikidata
JAllemandou moved T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop from Ready to Deploy to Done on the Analytics-Kanban board.
Tue, Dec 3, 2:53 PM · Analytics-Kanban, Analytics

Mon, Dec 2

JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

Hi @srishakatux - New data is available in table wmf.editors_daily for month '2019-11'. I think except from table name, your queries should work as-is. Let me know!

Mon, Dec 2, 2:49 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou moved T238855: Add bot and change_type dimensions to geoeditors-daily from Ready to Deploy to Done on the Analytics-Kanban board.
Mon, Dec 2, 2:48 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T239591: Update mediawiki-history to use new Multi-Content-Revision tables.
Mon, Dec 2, 10:45 AM · Core Platform Team, Analytics
JAllemandou created T239591: Update mediawiki-history to use new Multi-Content-Revision tables.
Mon, Dec 2, 10:43 AM · Core Platform Team, Analytics
JAllemandou created T239589: Change sqoop project list config so that content sqoop doesn't fail.
Mon, Dec 2, 10:38 AM · Analytics
JAllemandou moved T237271: Create a script to ease the Oozie work while enabling kerberos in Hadoop from Ready to Deploy to Done on the Analytics-Kanban board.
Mon, Dec 2, 10:17 AM · Analytics-Kanban, Analytics
JAllemandou moved T239471: Sqoop wikidata terms tables into hadoop from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Mon, Dec 2, 9:21 AM · Analytics-Kanban, User-Addshore, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics, Wikidata

Fri, Nov 29

JAllemandou moved T238432: Fix non MapReduce execution of GeoCode UDF from In Code Review to In Progress on the Analytics-Kanban board.
Fri, Nov 29, 7:50 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T239471: Sqoop wikidata terms tables into hadoop from Next Up to In Code Review on the Analytics-Kanban board.
Fri, Nov 29, 4:57 PM · Analytics-Kanban, User-Addshore, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics, Wikidata
JAllemandou added a project to T239471: Sqoop wikidata terms tables into hadoop: Analytics-Kanban.
Fri, Nov 29, 4:57 PM · Analytics-Kanban, User-Addshore, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Analytics, Wikidata
JAllemandou updated the task description for T238432: Fix non MapReduce execution of GeoCode UDF.
Fri, Nov 29, 1:21 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou renamed T238432: Fix non MapReduce execution of GeoCode UDF from Add MaxMind DB files on an-coord1001 for hive local-jobs using UDF to succeed to Fix non MapReduce execution of GeoCode UDF.
Fri, Nov 29, 1:20 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T238432: Fix non MapReduce execution of GeoCode UDF from In Progress to In Code Review on the Analytics-Kanban board.
Fri, Nov 29, 1:20 PM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, Nov 28

JAllemandou moved T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Thu, Nov 28, 5:49 PM · Analytics-Kanban, Analytics
JAllemandou moved T238855: Add bot and change_type dimensions to geoeditors-daily from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Thu, Nov 28, 5:49 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

Hi @srishakatux :)
Code is ready (see https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/552510/), but won't get merge before next week.
I'll backfill the december month next week, then let you know so that you can test and update the queries.
Something to note: I have renamed the table you're using, as it's not geoeditors-inly related, but more braodly storing information on editors. It's now:editors_daily (instead of geoeditors_daily).
I'll ping you when the change has been deployed and teh data available :)

Thu, Nov 28, 5:40 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Wed, Nov 27

JAllemandou added a comment to T101013: Log Wikidata Query Service queries to the event gate infrastructure.

Does this being closed mean we can access data on kafka?

Wed, Nov 27, 5:30 PM · Discovery-Search (Current work), Analytics, Event-Platform, Wikidata-Query-Service, Wikidata, Discovery
JAllemandou awarded T212824: notebook/stat server(s) running out of memory a Love token.
Wed, Nov 27, 5:12 PM · Analytics-Kanban, Product-Analytics, User-Elukey, Operations, Analytics
JAllemandou moved T238855: Add bot and change_type dimensions to geoeditors-daily from In Progress to In Code Review on the Analytics-Kanban board.
Wed, Nov 27, 5:00 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop from Next Up to In Code Review on the Analytics-Kanban board.
Wed, Nov 27, 5:00 PM · Analytics-Kanban, Analytics

Tue, Nov 26

JAllemandou added a comment to T212824: notebook/stat server(s) running out of memory.

@elukey: We should apply the same treatment for stat1007 :)

Tue, Nov 26, 10:05 PM · Analytics-Kanban, Product-Analytics, User-Elukey, Operations, Analytics
JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

Adding action-type to the base table makes the result be extremely similar. The very small difference is due to imperfect user-data we join to to gather bot information. I hope this is ok for you @srishakatux and @bd808.

  • Data from Bryan's script:
wiki_dbwmcs_editstotal_editswmcs_percent
commonswiki573263415757413.79%
enwiki41440158513607.08%
wikidatawiki122166081962885662.24%
Tue, Nov 26, 2:06 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

Thanks for the command @bd808. We're adding the change_type dimension to the underlying dataset, allowing for more use-cases (wmcs included).
About bot-filtering, being in the case of edits and not pageviews, we use a different approach. We use mediawiki user-groups, a user being in the bot group is flagged in the user_is_bot_by dimension with the value group, and we also use a regexp (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/user/UserEventBuilder.scala#L24), a user having its username matching it is flagged with the value name is the column (that column is an array, a user can have both values). Indeed detection is fragile :)

Tue, Nov 26, 12:37 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou renamed T238855: Add bot and change_type dimensions to geoeditors-daily from Add bot edits to geoeditors-daily to Add bot and change_type dimensions to geoeditors-daily.
Tue, Nov 26, 12:28 PM · Patch-For-Review, Analytics-Kanban, Analytics

Fri, Nov 22

JAllemandou moved T238855: Add bot and change_type dimensions to geoeditors-daily from Next Up to In Code Review on the Analytics-Kanban board.
Fri, Nov 22, 5:02 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T238855: Add bot and change_type dimensions to geoeditors-daily.
Fri, Nov 22, 5:02 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated subscribers of T232671: Use Reportupdater for WMCS edits queries.

Analysis of non-edit cuc_type for 3 big wikis:

spark.sql("""
  select
    cuc_type,
    wiki_db,
    count(1) as c
  from wmf_raw.mediawiki_private_cu_changes
  where month = '2019-10'
    and wiki_db in ('enwiki', 'commonswiki', 'wikidatawiki')
  group by cuc_type, wiki_db
  order by cuc_type, wiki_db
""").show(100, false)
Fri, Nov 22, 2:31 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou added a comment to T234333: Import siteinfo dumps onto HDFS.

New task: T238858

Fri, Nov 22, 9:49 AM · Analytics-Kanban, Analytics

Thu, Nov 21

JAllemandou created T238858: Update wikitext-processing on hadoop various aspects.
Thu, Nov 21, 7:31 PM · Analytics
JAllemandou updated the task description for T238855: Add bot and change_type dimensions to geoeditors-daily.
Thu, Nov 21, 7:23 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T238855: Add bot and change_type dimensions to geoeditors-daily.
Thu, Nov 21, 7:22 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a subtask for T232671: Use Reportupdater for WMCS edits queries: T238855: Add bot and change_type dimensions to geoeditors-daily.
Thu, Nov 21, 7:18 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou added a parent task for T238855: Add bot and change_type dimensions to geoeditors-daily: T232671: Use Reportupdater for WMCS edits queries.
Thu, Nov 21, 7:18 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

Task created: https://phabricator.wikimedia.org/T238855

Thu, Nov 21, 7:17 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou created T238855: Add bot and change_type dimensions to geoeditors-daily.
Thu, Nov 21, 7:17 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

Hi @srishakatux, thanks a lot for having double checked the data. The discrepancy is due to bot-edits being removed from geoeditors data. I'm very sorry not to have pinpointed that earlier :(
I'm assuming bot edits are important for your metric (a lot of bots are run from wmcs IIUC).
I think the best idea would be to add bots-edits to the geoeditors dataset, as it would be valuable for other analysis as well. We currently flag bots in 2 ways: by group (when the user is in the BOT group), and by name, when its name matches this regexp.
We (analytics) need to make a decision on how we want to add that info to the table, and then your requests should just work as is without difference, except for that one other thing: I think Bryan's script is counting more than edits as it doesn't filter for cuc_type (see https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type).
Let's confirm the above plan works for you @srishakatux .

Thu, Nov 21, 7:10 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Tue, Nov 19

JAllemandou added a comment to T236698: Logging level of cassandra should be warning or error but not debug.

I'm glad I double checked. Unfortunately the above 2 patches will probably not be enough. I've done a quick analysis on logs from a mediarequest backfilling job (~100Gb logs):

val df = spark.read.text("/var/log/hadoop-yarn/apps/analytics/logs/application_1573208467349_68436")
Tue, Nov 19, 7:02 PM · Patch-For-Review, Analytics-Kanban, Analytics

Mon, Nov 18

JAllemandou renamed T238400: Evaluate possible replacements for Camus: Gobblin, Marmaray, etc. from Evaluate possible replacements for Camus: Gobblin, Marmaryan, etc. to Evaluate possible replacements for Camus: Gobblin, Marmaray, etc..
Mon, Nov 18, 4:41 PM · Event-Platform, Analytics
JAllemandou moved T237271: Create a script to ease the Oozie work while enabling kerberos in Hadoop from In Progress to Ready to Deploy on the Analytics-Kanban board.
Mon, Nov 18, 3:58 PM · Analytics-Kanban, Analytics
JAllemandou moved T238326: Make hdfs-rsync process sub-folders recursively from Next Up to In Progress on the Analytics-Kanban board.
Mon, Nov 18, 3:57 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T238400: Evaluate possible replacements for Camus: Gobblin, Marmaray, etc..

Another brand new tools we could have a look at: https://github.com/uber/marmaray

Mon, Nov 18, 10:35 AM · Event-Platform, Analytics

Fri, Nov 15

JAllemandou created T238432: Fix non MapReduce execution of GeoCode UDF.
Fri, Nov 15, 6:18 PM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, Nov 14

JAllemandou created T238326: Make hdfs-rsync process sub-folders recursively.
Thu, Nov 14, 1:25 PM · Analytics-Kanban, Analytics
JAllemandou created T238304: Make hdfs-cleaner resilient to in-flight files deletion.
Thu, Nov 14, 9:28 AM · Analytics-Kanban, Analytics

Nov 7 2019

JAllemandou closed T237579: Wikistats data discrepancy for India page views from hive data pull as Invalid.
Nov 7 2019, 11:03 AM · Analytics-Wikistats, Analytics
JAllemandou added a comment to T237579: Wikistats data discrepancy for India page views from hive data pull .

Hi @Iflorez,
Data presented in Wikistats comes from AQS endpoints (see here for a list of pageviews related endpoints and their description).
The wikistats URL you have pasted above calls https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/mr.wikipedia.org/mobile-app/all-agents/monthly/2017100100/2019110700 to retrieve data.

Nov 7 2019, 11:03 AM · Analytics-Wikistats, Analytics
JAllemandou added a comment to T233661: Publish tls related info to webrequest via varnish.

Done! Example:

spark2-shell --master yarn --driver-memory 4G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=32 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
Nov 7 2019, 9:11 AM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou added a comment to T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available.

Done! Hidding the turnilo aweful link under this.

Nov 7 2019, 9:02 AM · Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou moved T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 7 2019, 8:59 AM · Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou moved T233661: Publish tls related info to webrequest via varnish from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 7 2019, 8:53 AM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic

Nov 6 2019

JAllemandou added a comment to T237043: Keep edit counts in separate database table and update on edit.

The main wonder I have around using analytics infra for serving page-count is on the delay we have in update. We update the data once every month, so for many pages data will be drifting. Depending on the purpose of the count, maybe it's not critical, but I speak as to that. If we decide to try to use Druid, let's discuss this in a meeting :)

Nov 6 2019, 2:06 PM · Core Platform Team Workboards (Green), MediaWiki-REST-API, CPT Initiatives (Core REST API in PHP)
JAllemandou moved T237271: Create a script to ease the Oozie work while enabling kerberos in Hadoop from Next Up to In Progress on the Analytics-Kanban board.
Nov 6 2019, 1:34 PM · Analytics-Kanban, Analytics
JAllemandou moved T233661: Publish tls related info to webrequest via varnish from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Nov 6 2019, 1:34 PM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou moved T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Nov 6 2019, 1:34 PM · Analytics-Kanban, observability, Operations, Analytics, Traffic

Nov 5 2019

JAllemandou moved T233661: Publish tls related info to webrequest via varnish from In Progress to In Code Review on the Analytics-Kanban board.
Nov 5 2019, 1:03 PM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou moved T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available from Next Up to In Code Review on the Analytics-Kanban board.
Nov 5 2019, 1:03 PM · Analytics-Kanban, observability, Operations, Analytics, Traffic

Nov 4 2019

JAllemandou added a comment to T237072: Correct namespace zero editor counts on geoeditors_monthly table on hive and druid.

Makes sense - Thanks for the explanation @mforns :)

Nov 4 2019, 7:51 PM · Product-Analytics, Analytics, Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T233661: Publish tls related info to webrequest via varnish.

I checked for message length in one day of webrequest, and we top at 4916 bytes.
I think Kafka will be fine as per message-size, and I assume it'll also be fine on global data growth.
Same for hadoop.
Let's make this happen.

Nov 4 2019, 7:51 PM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou added a comment to T233661: Publish tls related info to webrequest via varnish.

Thanks for the fast loop over format @Nuria and @BBlack.
Indeed having a single field named TLS formatted as described above is best (we could even go for smaller keys, but it makes it less human readable).
Hadoop will then process that into an explicit map, and we'll load them in turnilo as separate fields.
Something to keep in mind: Turnilo is sampled (1 over 128), so for instances with very small probability of appearance, full data processing in Spark will be needed.

Nov 4 2019, 12:46 PM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic

Nov 1 2019

JAllemandou added a comment to T237072: Correct namespace zero editor counts on geoeditors_monthly table on hive and druid.

Great plan @Milimetric! One precision: We only have geoeditors_daily up to 2019-09, so we can backfill from there but no more, and we need to nullify up to 2019-08.

hdfs dfs -ls /wmf/data/wmf/mediawiki_private/geoeditors_daily
Found 1 items
drwxr-x---   - analytics analytics-privatedata-users          0 2019-10-02 08:44 /wmf/data/wmf/mediawiki_private/geoeditors_daily/month=2019-09

Side note: This retention period feels small!! Is the deletion scheme deleting more than expected?

Nov 1 2019, 7:40 PM · Product-Analytics, Analytics, Patch-For-Review, Analytics-Kanban
JAllemandou renamed T237047: Update data-purge for processed mediawiki_wikitext_history (6 snapshot kept, 3 would be sufficient) from Add data-purge for processed mediawiki_wikitext_history to Update data-purge for processed mediawiki_wikitext_history (6 snapshot kept, 3 would be sufficient).
Nov 1 2019, 12:35 PM · Analytics
JAllemandou updated subscribers of T237047: Update data-purge for processed mediawiki_wikitext_history (6 snapshot kept, 3 would be sufficient).

After reviewing data-deletion scripts, wikitext_history snapshots are deleted, but 6 of them are kept.
See https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L128 and https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots.

Nov 1 2019, 12:34 PM · Analytics
JAllemandou added a comment to T236687: Check Avro as potential better file format for wikitext-history.

Table is filled with data converted manually, new job seems configured correctly. We'll know about it when dumps are fully imported.

Nov 1 2019, 9:38 AM · Analytics-Kanban, Analytics
JAllemandou moved T236687: Check Avro as potential better file format for wikitext-history from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 1 2019, 9:37 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T234333: Import siteinfo dumps onto HDFS.

Checked logs from this morning, they look good (nothing to import yet, but no error)

Nov 1 2019, 9:32 AM · Analytics-Kanban, Analytics
JAllemandou moved T234333: Import siteinfo dumps onto HDFS from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 1 2019, 9:32 AM · Analytics-Kanban, Analytics

Oct 31 2019

JAllemandou created T237047: Update data-purge for processed mediawiki_wikitext_history (6 snapshot kept, 3 would be sufficient).
Oct 31 2019, 7:33 PM · Analytics
JAllemandou updated subscribers of T233661: Publish tls related info to webrequest via varnish.

Some maths about datasize increase using approximated ratios for values TLS-Version, Key-Exchange, Auth and Cipher from https://grafana.wikimedia.org/d/000000458/tls-ciphersuite-explorer?orgId=1. I have made assumptions that we need full-cipher, and took values from cipher as an approximation.

Oct 31 2019, 7:29 PM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou moved T234333: Import siteinfo dumps onto HDFS from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Oct 31 2019, 5:15 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T234591: Make job to backfill data from mediacounts into mediarequests tables in cassandra so as to have historical mediarequest data .

I think we should start backfilling from 2019 backwards so as to have a "continuous" dataset. For a live api that (in theory) can be queries probably that makes more sense than having data ranges in 2015 and 2019 and nothing in between.

+1. Let's fix the missing days and start backfilling backward.

Oct 31 2019, 3:28 PM · Analytics-Kanban, Multimedia, Analytics, Tool-Pageviews
JAllemandou moved T233661: Publish tls related info to webrequest via varnish from Next Up to In Progress on the Analytics-Kanban board.
Oct 31 2019, 3:17 PM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic
JAllemandou moved T236985: Understand why SQL string pattern matching differ from Hive to Spark from Next Up to Done on the Analytics-Kanban board.
Oct 31 2019, 3:07 PM · Analytics-Kanban, Analytics
JAllemandou claimed T236985: Understand why SQL string pattern matching differ from Hive to Spark.
Oct 31 2019, 3:07 PM · Analytics-Kanban, Analytics
JAllemandou added a project to T236985: Understand why SQL string pattern matching differ from Hive to Spark: Analytics-Kanban.
Oct 31 2019, 3:06 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T236985: Understand why SQL string pattern matching differ from Hive to Spark.

Doc updated: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Use_RLIKE_instead_of_LIKE_when_working_with_multi-line_text

Oct 31 2019, 3:05 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T236985: Understand why SQL string pattern matching differ from Hive to Spark.

Got nerd-snipped here, but I found it: https://issues.apache.org/jira/browse/HIVE-22008

Oct 31 2019, 2:45 PM · Analytics-Kanban, Analytics
JAllemandou moved T236687: Check Avro as potential better file format for wikitext-history from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Oct 31 2019, 2:25 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T236687: Check Avro as potential better file format for wikitext-history.

Created T236985 about spark/hive like difference

Oct 31 2019, 2:18 PM · Analytics-Kanban, Analytics
JAllemandou created T236985: Understand why SQL string pattern matching differ from Hive to Spark.
Oct 31 2019, 9:07 AM · Analytics-Kanban, Analytics

Oct 30 2019

JAllemandou added a comment to T236895: ArticlePlaceholder dashboard stopped tracking page views.

I think this problem could be related to T226730 (preventing most Special:XXX pages to be flagged as pageviews).

Oct 30 2019, 3:01 PM · Analytics, wikidata-tech-focus, Wikidata-Campsite, Wikidata, ArticlePlaceholder
JAllemandou moved T236687: Check Avro as potential better file format for wikitext-history from In Progress to In Code Review on the Analytics-Kanban board.
Oct 30 2019, 12:10 PM · Analytics-Kanban, Analytics
JAllemandou updated subscribers of T233661: Publish tls related info to webrequest via varnish.

Thanks @Vgutierrez - I think representing those values in a map (or an array) is probably the easiest and most flexible way.
@elukey : Will you teach me varnish_kafka so that I can those values from VCL_Log into a new field?

Oct 30 2019, 9:38 AM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic

Oct 29 2019

JAllemandou added a comment to T236687: Check Avro as potential better file format for wikitext-history.

I have generated avro files for 2019-09 dumps, and ran quite some queries on them with limited amount of RAM needed per executor for spark, and without issue on Hive. Parquet read-optmizations were definitely what was causing issues here.
I however hit some interesting finding:

Oct 29 2019, 8:46 PM · Analytics-Kanban, Analytics
JAllemandou moved T236687: Check Avro as potential better file format for wikitext-history from Next Up to In Progress on the Analytics-Kanban board.
Oct 29 2019, 7:42 PM · Analytics-Kanban, Analytics
JAllemandou moved T234333: Import siteinfo dumps onto HDFS from In Progress to In Code Review on the Analytics-Kanban board.
Oct 29 2019, 7:42 PM · Analytics-Kanban, Analytics
JAllemandou merged task T217350: Partition event-data daily instead of hourly (for sanitized data) into T236794: Find a strategy to mitigate small-files handling for long-term kept events.
Oct 29 2019, 3:01 PM · Analytics
JAllemandou merged T217350: Partition event-data daily instead of hourly (for sanitized data) into T236794: Find a strategy to mitigate small-files handling for long-term kept events.
Oct 29 2019, 3:01 PM · Analytics
JAllemandou added a project to T236794: Find a strategy to mitigate small-files handling for long-term kept events: Analytics.
Oct 29 2019, 3:01 PM · Analytics
JAllemandou created T236794: Find a strategy to mitigate small-files handling for long-term kept events.
Oct 29 2019, 2:13 PM · Analytics

Oct 28 2019

JAllemandou added a project to T236687: Check Avro as potential better file format for wikitext-history: Analytics-Kanban.
Oct 28 2019, 1:38 PM · Analytics-Kanban, Analytics
JAllemandou claimed T236687: Check Avro as potential better file format for wikitext-history.
Oct 28 2019, 1:38 PM · Analytics-Kanban, Analytics