Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (243 w, 5 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Today

JAllemandou added a comment to T234188: Taxonomy of new user reading patterns.

I looked at the code and have some comments, but not that many given the complexity of the analysis :) Good job @MGerlach!

Mon, Oct 14, 8:09 PM · Analytics, Research
JAllemandou updated the task description for T235448: Add Balinese wikipedia to analytics setup.
Mon, Oct 14, 7:41 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T232671: Use Reportupdater for WMCS edits queries.

The config file I've is here https://github.com/srish/wmcs-edits/blob/master/config.yaml.

One comment on the value you picked for lagin the config. Reportupdater will by default run a query having granularity month after the month is done. For instance it will run query for the month of 2019-11 on 2019-12-01, with start_date = 2019-11-01 and end_date = 2019-12-01 (inclusive start-date, exclusive end-date). The lag parameter is time you add to the default running date for data availability (or any other reason). In our case, data is being copied from mysql databases onto the hadoop cluster on the 1st of the month, and in past months it has been very regular in taking just a few hours. I therefore suggest using a lag of 1 day, also noticing that rerunning reportupdater queries is just a matter of removing the file with incorrect data.

Mon, Oct 14, 7:07 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou moved T235448: Add Balinese wikipedia to analytics setup from Next Up to In Code Review on the Analytics-Kanban board.
Mon, Oct 14, 6:29 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T235448: Add Balinese wikipedia to analytics setup.
Mon, Oct 14, 6:25 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T235448: Add Balinese wikipedia to analytics setup.
Mon, Oct 14, 6:22 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T235448: Add Balinese wikipedia to analytics setup.
Mon, Oct 14, 6:22 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a project to T235260: Analytics Access for Grant: Analytics-Kanban.
Mon, Oct 14, 3:32 PM · SRE-Access-Requests, Operations, Analytics-Kanban, Analytics
JAllemandou triaged T235260: Analytics Access for Grant as High priority.
Mon, Oct 14, 3:32 PM · SRE-Access-Requests, Operations, Analytics-Kanban, Analytics
JAllemandou moved T235260: Analytics Access for Grant from Incoming to Ops Week on the Analytics board.
Mon, Oct 14, 3:32 PM · SRE-Access-Requests, Operations, Analytics-Kanban, Analytics
JAllemandou triaged T235268: HivePartition (refinery::Hive.py) does not allow partition values to have dots (.) as High priority.
Mon, Oct 14, 3:31 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T235268: HivePartition (refinery::Hive.py) does not allow partition values to have dots (.) from Incoming to Operational Excellence on the Analytics board.
Mon, Oct 14, 3:31 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou triaged T235269: MediaWiki history dumps have some events in 2025 as High priority.
Mon, Oct 14, 3:30 PM · Analytics-Kanban, Analytics
JAllemandou moved T235269: MediaWiki history dumps have some events in 2025 from Incoming to Ops Week on the Analytics board.
Mon, Oct 14, 3:30 PM · Analytics-Kanban, Analytics
JAllemandou claimed T235269: MediaWiki history dumps have some events in 2025.
Mon, Oct 14, 3:30 PM · Analytics-Kanban, Analytics
JAllemandou closed T235278: browser dashboards not updated since 09/29 as Declined.
Mon, Oct 14, 3:30 PM · Analytics
JAllemandou triaged T235283: Add partition pruning for wmf.browser_general and interlanguage as High priority.
Mon, Oct 14, 3:29 PM · Analytics-Kanban, Analytics
JAllemandou moved T235283: Add partition pruning for wmf.browser_general and interlanguage from Operational Excellence to Ops Week on the Analytics board.
Mon, Oct 14, 3:29 PM · Analytics-Kanban, Analytics
JAllemandou moved T235283: Add partition pruning for wmf.browser_general and interlanguage from Incoming to Operational Excellence on the Analytics board.
Mon, Oct 14, 3:29 PM · Analytics-Kanban, Analytics
JAllemandou claimed T235283: Add partition pruning for wmf.browser_general and interlanguage.
Mon, Oct 14, 3:28 PM · Analytics-Kanban, Analytics
JAllemandou renamed T235283: Add partition pruning for wmf.browser_general and interlanguage from Add partition pruning for wmf.browser_general to Add partition pruning for wmf.browser_general and interlanguage.
Mon, Oct 14, 3:28 PM · Analytics-Kanban, Analytics
JAllemandou claimed T235409: Update mediawiki-history dumper to use project in file names.
Mon, Oct 14, 3:26 PM · Analytics-Kanban, Analytics
JAllemandou triaged T235409: Update mediawiki-history dumper to use project in file names as High priority.
Mon, Oct 14, 3:26 PM · Analytics-Kanban, Analytics
JAllemandou moved T235409: Update mediawiki-history dumper to use project in file names from Incoming to Ops Week on the Analytics board.
Mon, Oct 14, 3:26 PM · Analytics-Kanban, Analytics
JAllemandou triaged T235418: Add metadata to puppet about kerberos accounts as Normal priority.
Mon, Oct 14, 3:24 PM · Operations, Analytics
JAllemandou moved T235418: Add metadata to puppet about kerberos accounts from Incoming to Operational Excellence on the Analytics board.
Mon, Oct 14, 3:24 PM · Operations, Analytics
JAllemandou created T235409: Update mediawiki-history dumper to use project in file names.
Mon, Oct 14, 8:00 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T235269: MediaWiki history dumps have some events in 2025.
spark.sql("select wiki_db, event_entity, event_type, count(1) as c from wmf.mediawiki_history where snapshot = '2019-09' and event_timestamp > '2020-01-01 00:00:00' group by wiki_db, event_entity, event_type order by c desc limit 100").show(100, false)
Mon, Oct 14, 7:59 AM · Analytics-Kanban, Analytics

Fri, Oct 11

JAllemandou added a comment to T222253: Upgrade Spark to 2.4.x.

There are a couple of jobs I'd like to check (mediawiki-history, checker, mobile-app-session jobs and wikidata jobs). If we can't run in yarn, I'll do smaller versions in local.

Fri, Oct 11, 7:16 AM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-Cluster

Thu, Oct 10

JAllemandou updated subscribers of T232671: Use Reportupdater for WMCS edits queries.

Hi @srishakatux - Please excuse me again for another delayed answer - I do hope my personal issues are over now :)

Thu, Oct 10, 1:24 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou added a comment to T234684: Superset not able to load a reading dashboard .

Here is my understanding of the different things at play here:

  • Caching is made by the broker by query - per-segment. This means for a given set query-parameters (excluding interval boundaries), the cache uses/stores results for this query for every segment required by the query interval.
  • When using a line chart and a groupBy with a single dimension, Superset issues 2 queries, the first one to gather the list of values to use (topN over the given time-period, with ALL granularity), and then another one to get split-by-time results for the precomputed list of values (topN over the given time-period, with the selected granularity and a OR filter for the values list). It is interesting to notice that when using groupBy with 2 dimensions, superset issues a single groupBy query since the double-query optimization can only be done over one dimension.
  • When the query for a long time-period fails, it fails on the second query (the one with the time-split and value-filter). The first one that precomputes values is always successful (at least with my tests). Something else to notice is that this is this first query that supersert displays when using "View query".
  • Preloading the segment-result cache for a long time-period (4 years for instance) means running the second query for a small-enough time period (1 year worked for me in our case), allowing the query to be successful and therefore the cache to be loaded for the queried segments. Then reissuing the query for a longer time-period (2 years for instance), and again, up to having cached data for most of the needed segments. I confirm that this strategy works, I have managed to have a 4 years chart in superset.
  • The gotcha here is that this strategy cannot usually be transfered to superset: when asking superset for a 1 year chart, then a 2 years chart, it will issue 2 queries each time, and it is reasonable that the list of values parameterizing the second query will vary from the 1 year chart to the 2 years chart, meaning no caching is actually made for the 2 years chart when querying for the 1 year chart as query parameters are not the same. I have overcome this limitation and managed to draw a 4 years chart in superset by issuing manual queries to druid having the correct parameters.
Thu, Oct 10, 11:43 AM · Analytics-Kanban, Analytics

Thu, Oct 3

JAllemandou added a comment to T226663: Develop a tool or integrate feature in existing one to visualize WMCS edits data.

Understood @bd808 - Thanks for the answer :)

Thu, Oct 3, 5:45 PM · Analytics, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou added a comment to T209655: Copy Wikidata dumps to HDFs.

this is done @GoranSMilovanovic.
Raw data is here /user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902 and parquet data is here /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902

Thu, Oct 3, 9:57 AM · Research-Backlog, Wikidata, Analytics
JAllemandou added a comment to T226663: Develop a tool or integrate feature in existing one to visualize WMCS edits data.

@Milimetric @JAllemandou Hello! Thank you both for your help. This project is a learning project for me :) So, for the pieces that are feasible to do by the folks outside your team, I would love to give a shot at them. I might be a bit slow and reach out to you via IRC for questions if that is okay with you.
Perhaps setting the ReportUpdater (I have a task for it already here T232671), or configuration file that needs to be created on Meta for generating the dashboard.

Thu, Oct 3, 8:22 AM · Analytics, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Tue, Oct 1

JAllemandou created T234333: Import siteinfo dumps onto HDFS.
Tue, Oct 1, 2:29 PM · Analytics-Kanban, Analytics
JAllemandou updated subscribers of T226663: Develop a tool or integrate feature in existing one to visualize WMCS edits data.

Hi @srishakatux - I'm sorry for not answering sooner, I've been sick the whole past week :(
Let's synchronize on how to generate the data.
In my mind the plan is to:

  1. have the network_origin column added to geoeditors-daily (T233504 and associated CR https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538613/)
  2. Create a reportupdater query generating the aggregated data with desired dimensions every month, storing results in a new file per month
Tue, Oct 1, 8:55 AM · Analytics, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Mon, Sep 30

JAllemandou moved T226663: Develop a tool or integrate feature in existing one to visualize WMCS edits data from Next Up to In Progress on the Analytics-Kanban board.
Mon, Sep 30, 4:27 PM · Analytics, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
JAllemandou added a comment to T232123: Parse wikidumps and extract redirect information for 1 small wiki, romanian .

Hi @MGerlach and @leila - kids and I have been sick almost full last week, explaining me not answering fast.
I have spent time trying to get a precise answer to the memory issue, but couldn't get as precise as would have expected :(

Mon, Sep 30, 3:40 PM · Research, Analytics

Mon, Sep 23

JAllemandou added a comment to T233504: add whether an edit happened on cloud VPS to geoeditors-daily dataset .

Provided 2 patches based on the request above as examples. More discussion is probably needed based on needed retention.
Example results for enwiki for 2019-08:

Mon, Sep 23, 1:24 PM · Patch-For-Review, Analytics-Kanban, Analytics, Cloud-Services, Developer-Advocacy (Jul-Sep 2019)
JAllemandou added a comment to T214093: Modern Event Platform: Schema Guidelines and Conventions.

Hi @Ottomata - I like the annotations subobject - Explicit for the win. I however have no good reason not to use the other solution.

Mon, Sep 23, 12:17 PM · Analytics-Kanban, CPT Initiatives (Modern Event Platform (TEC2)), Analytics, Better Use Of Data, Patch-For-Review, Product-Analytics, Goal, Services (watching), Analytics-EventLogging, Event-Platform
JAllemandou added a comment to T232123: Parse wikidumps and extract redirect information for 1 small wiki, romanian .

Hi @MGerlach,
Awesome results :)

Mon, Sep 23, 12:14 PM · Research, Analytics

Thu, Sep 19

JAllemandou added a comment to T212854: Upgrade ua parser to latest version for both java and python.

Documentation updated:

Thu, Sep 19, 9:38 AM · Patch-For-Review, Analytics-Kanban, Analytics

Wed, Sep 18

JAllemandou moved T231589: Mediarequests: Add endpoint for agreggated counts per file type per project from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Sep 18, 7:34 PM · Analytics-Kanban, Services (watching), Analytics
JAllemandou moved T232858: Add cassandra loading job for mediarequests per referer from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Sep 18, 7:34 PM · Analytics-Kanban, Services (watching), Analytics
JAllemandou moved T212854: Upgrade ua parser to latest version for both java and python from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Sep 18, 7:34 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T208612: Release edit data lake data as a public json dump /mysql dump, other? from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Sep 18, 7:34 PM · Patch-For-Review, Analytics-Kanban, Research-Backlog, Analytics
JAllemandou moved T232857: Add mediarequests per referer endpoint to AQS from Ready to Deploy to Done on the Analytics-Kanban board.
Wed, Sep 18, 7:34 PM · Patch-For-Review, Analytics-Kanban, Services (watching), Analytics
JAllemandou created T233215: ConfirmEdit seemingly erroneously enabled for some users on wikitech.
Wed, Sep 18, 1:22 PM · wikitech.wikimedia.org, ConfirmEdit (CAPTCHA extension), Operations
JAllemandou moved T232857: Add mediarequests per referer endpoint to AQS from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Wed, Sep 18, 12:10 PM · Patch-For-Review, Analytics-Kanban, Services (watching), Analytics
JAllemandou moved T232858: Add cassandra loading job for mediarequests per referer from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Wed, Sep 18, 12:10 PM · Analytics-Kanban, Services (watching), Analytics
JAllemandou moved T208612: Release edit data lake data as a public json dump /mysql dump, other? from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Wed, Sep 18, 12:10 PM · Patch-For-Review, Analytics-Kanban, Research-Backlog, Analytics
JAllemandou moved T212854: Upgrade ua parser to latest version for both java and python from In Progress to Ready to Deploy on the Analytics-Kanban board.
Wed, Sep 18, 12:10 PM · Patch-For-Review, Analytics-Kanban, Analytics

Tue, Sep 17

JAllemandou added a comment to T232123: Parse wikidumps and extract redirect information for 1 small wiki, romanian .

This is great finding!
This file does not contain redirect only, but also every other aliases that might be usefull :)
Awesome

Tue, Sep 17, 5:55 PM · Research, Analytics

Sep 13 2019

JAllemandou added a comment to T232843: Can we add ORES data so it can be easily retrieved per revision present on mediawiki history?.

+1 to that :)

Sep 13 2019, 8:02 PM · Analytics
JAllemandou added a comment to T232843: Can we add ORES data so it can be easily retrieved per revision present on mediawiki history?.

I mildly disaggre with @Nuria for ORES scores - I think it could be very cool to have them (some models only, one model version only). Maybe in a separate table, or in different dumps. As for parsoid, I don't really understand what that is.
Edited for typos.

Sep 13 2019, 4:00 PM · Analytics
JAllemandou added a comment to T212854: Upgrade ua parser to latest version for both java and python.

After yesterday's talk in tasking, here is the plan:

Sep 13 2019, 10:31 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou renamed T212854: Upgrade ua parser to latest version for both java and python from Upgrade n ua parser to latest version for both java and python to Upgrade ua parser to latest version for both java and python.
Sep 13 2019, 9:27 AM · Patch-For-Review, Analytics-Kanban, Analytics

Sep 12 2019

JAllemandou awarded T232707: Requesting access to analytics cluster for Martin Gerlach a Stroopwafel token.
Sep 12 2019, 4:10 PM · Analytics, Operations, SRE-Access-Requests

Sep 11 2019

kzimmerman awarded T232382: Discrepancies in Superset Pageview Data a Party Time token.
Sep 11 2019, 8:17 PM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou added a comment to T225211: Change event.mediawiki_revision_score schema to use map types.

I recall that as well @Ottomata - My note was purely theoretical :) I'm fully in favor of having a map keyed by model names.

Sep 11 2019, 8:08 PM · Analytics-Kanban, Patch-For-Review, Analytics, Event-Platform, Scoring-platform-team
JAllemandou added a comment to T225211: Change event.mediawiki_revision_score schema to use map types.

I like the idea of having the model-name as map-key. Only limitation I can think of is that only one model version can be reported on a revision, except if we put an array (or a version map !) as the map value ... Seems overkill.

Sep 11 2019, 7:41 PM · Analytics-Kanban, Patch-For-Review, Analytics, Event-Platform, Scoring-platform-team

Sep 10 2019

JAllemandou set the point value for T231856: Cleanup refinery artifacts folder from unneeded jars to 5.
Sep 10 2019, 6:59 PM · Analytics-Kanban, Analytics
JAllemandou set the point value for T232382: Discrepancies in Superset Pageview Data to 3.
Sep 10 2019, 6:59 PM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou set the point value for T228883: mediawiki-history-wikitext-coord job fails every month to 5.
Sep 10 2019, 6:59 PM · Analytics, Analytics-Kanban
JAllemandou moved T212854: Upgrade ua parser to latest version for both java and python from Next Up to In Progress on the Analytics-Kanban board.
Sep 10 2019, 6:59 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T231856: Cleanup refinery artifacts folder from unneeded jars from In Progress to In Code Review on the Analytics-Kanban board.
Sep 10 2019, 6:58 PM · Analytics-Kanban, Analytics
JAllemandou moved T228883: mediawiki-history-wikitext-coord job fails every month from In Progress to Paused on the Analytics-Kanban board.
Sep 10 2019, 6:58 PM · Analytics, Analytics-Kanban
JAllemandou added a comment to T228883: mediawiki-history-wikitext-coord job fails every month .

Snapshot 2019-07 manually fixed using a manually uncompressed version of the problematic file.
Leaving this task open in pause for possible future similar problem.

Sep 10 2019, 6:58 PM · Analytics, Analytics-Kanban
JAllemandou added a comment to T232382: Discrepancies in Superset Pageview Data.

Doc updated here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Usage_notes

Sep 10 2019, 6:45 PM · Analytics-Kanban, Product-Analytics, Analytics
fdans awarded T232382: Discrepancies in Superset Pageview Data a Hungry Hippo token.
Sep 10 2019, 1:12 PM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou added a comment to T228883: mediawiki-history-wikitext-coord job fails every month .

Update with investigation results so far:
August job failed for dewiki only, with a decompression error:

java.lang.ArrayIndexOutOfBoundsException: 18002
	at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:732)
	at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:803)
	at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:506)
	at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:335)
	at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:425)
	at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:485)
	at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:504)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.wikimedia.wikihadoop.newapi.ByteMatcher.readUntilMatch(ByteMatcher.scala:116)
	at org.wikimedia.wikihadoop.newapi.ByteMatcher.readUntilMatch(ByteMatcher.scala:110)
	at org.wikimedia.wikihadoop.newapi.MediawikiXMLRevisionInputFormat$MediawikiXMLRevisionRecordReader.readNextRevision(MediawikiXMLRevisionInputFormat.scala:256)
	at org.wikimedia.wikihadoop.newapi.MediawikiXMLRevisionInputFormat$MediawikiXMLRevisionRecordReader.nextKeyValue(MediawikiXMLRevisionInputFormat.scala:208)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1836)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I have tracked down the error to that precise file: /wmf/data/raw/mediawiki/xmldumps/pages_meta_history/20190801/dewiki/dewiki-20190801-pages-meta-history6.xml-p9471825p9637791.bz2. All other dewiki files don't cause problem.

Sep 10 2019, 9:13 AM · Analytics, Analytics-Kanban
elukey awarded T232382: Discrepancies in Superset Pageview Data a Like token.
Sep 10 2019, 8:20 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou moved T232382: Discrepancies in Superset Pageview Data from Next Up to In Code Review on the Analytics-Kanban board.
Sep 10 2019, 8:11 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou claimed T232382: Discrepancies in Superset Pageview Data.
Sep 10 2019, 8:11 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou added a comment to T232382: Discrepancies in Superset Pageview Data.

Thanks for reporting @kzimmerman :)
The reason for the discrepancy is the aggregation function of the generated druid query when using superset metric-aggregation. When choosing the view_count metric in a superset chart, then picking the SUM aggregation function, the aggregation is managed by superset, and uses floatSum which leads to innacuracies due to big numbers worked with 32 bits instead of 64. The predefined SUM(view_count) should be picked instead, which is manually defined to use doubleSum.
Druid queries run from a stat machine showing different results using floatSum or doubleSum:

  • doubleSum (correct - matches turnilo and wikistats2)
curl -X POST -H 'content-type: application/json' -d '
{
  "queryType": "timeseries",
  "dataSource": "pageviews_daily",
  "aggregations": [
    {
      "fieldName": "view_count",
      "fieldNames": [
        "view_count"
      ],
      "type": "doubleSum",
      "name": "sum_view_count"
    }
  ],
  "granularity": "all",
  "postAggregations": [],
  "intervals": "2019-07-01/2019-08-01",
  "filter": {
    "type": "selector",
    "dimension": "agent_type",
    "value": "user"
  }
}
' http://druid1001.eqiad.wmnet:8082/druid/v2/?pretty
Sep 10 2019, 8:10 AM · Analytics-Kanban, Product-Analytics, Analytics

Sep 6 2019

JAllemandou placed T217848: Sqoop: remove cuc_comment and join to comment table up for grabs.
Sep 6 2019, 7:01 AM · Analytics
JAllemandou moved T217848: Sqoop: remove cuc_comment and join to comment table from In Progress to Paused on the Analytics-Kanban board.
Sep 6 2019, 7:01 AM · Analytics
JAllemandou claimed T232171: event_user_id is always NULL for anonymous edits in Mediawiki History table.
Sep 6 2019, 7:00 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou moved T232171: event_user_id is always NULL for anonymous edits in Mediawiki History table from Next Up to In Code Review on the Analytics-Kanban board.
Sep 6 2019, 7:00 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou added a project to T232171: event_user_id is always NULL for anonymous edits in Mediawiki History table: Analytics-Kanban.
Sep 6 2019, 7:00 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou added a comment to T232171: event_user_id is always NULL for anonymous edits in Mediawiki History table.

Good point @nettrom_WMF, there is indeed a discrepancy between doc and reality, introduced with the database change of using a normalized table for user information (the actor table) instead of storing all user information (mostly user_id and user_text) in the main tables (revision, logging, etc). With this change user_id of anonymous users is always null in mediawiki_history (as in original database), meaning you can't rely on it as a proxy for deleted revisions. You should use revision_deleted_parts instead, even if not as easy (sorry :S). Updating the docs right now.

Sep 6 2019, 6:57 AM · Analytics-Kanban, Product-Analytics, Analytics

Sep 5 2019

JAllemandou awarded T232123: Parse wikidumps and extract redirect information for 1 small wiki, romanian a Love token.
Sep 5 2019, 4:51 PM · Research, Analytics
JAllemandou updated subscribers of T231856: Cleanup refinery artifacts folder from unneeded jars.

Question for @Ottomata and @Nuria : Do we prefer to move old jars to new ones and get rid of every jar older than version X, or do we get rid of currently-unused jars only (which already represents a huge win)?

Sep 5 2019, 12:54 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T231856: Cleanup refinery artifacts folder from unneeded jars.

Audit of versioned jar files needed (non-versioned files should always be there):

  • In puppet repo:
jarDefined in
camus-wmf-0.1.0-wmf9.jarmodules/profile/manifests/analytics/refinery/job/camus.pp and test/camus.pp
refinery-camus-0.0.90.jarmodules/profile/manifests/analytics/refinery/job/camus.pp and test/camus.pp
refinery-job-0.0.83.jarmodules/profile/manifests/analytics/refinery/job/druid_load.pp
refinery-job-0.0.94.jarmodules/profile/manifests/analytics/refinery/job/refine.pp
refinery-job-0.0.97.jarmodules/profile/manifests/analytics/refinery/job/test/refine.pp
Sep 5 2019, 12:52 PM · Analytics-Kanban, Analytics
JAllemandou moved T217848: Sqoop: remove cuc_comment and join to comment table from Next Up to In Progress on the Analytics-Kanban board.
Sep 5 2019, 8:18 AM · Analytics
JAllemandou added a comment to T217848: Sqoop: remove cuc_comment and join to comment table.

Changing sqoop to join a base table to the comment one in the sqoop-SQL query has been tested for mediawiki-history and has lead to non-acceptable performance (too long).
We now sqoop the comment and actor tables from analytics production replicas for acceptable performance time, and join base tables to the actor and comment ones in Spark for mediawiki-history.
I suggest the same approach should be used for cu_changes, meaning waiting for actor and comment tables to be present, and join to them.
Waiting for comments before starting to work in that direction.

Sep 5 2019, 8:18 AM · Analytics
JAllemandou moved T228883: mediawiki-history-wikitext-coord job fails every month from Next Up to In Progress on the Analytics-Kanban board.
Sep 5 2019, 7:48 AM · Analytics, Analytics-Kanban

Sep 4 2019

JAllemandou moved T231856: Cleanup refinery artifacts folder from unneeded jars from Next Up to In Progress on the Analytics-Kanban board.
Sep 4 2019, 5:46 PM · Analytics-Kanban, Analytics
JAllemandou moved T231002: Refactor quenename into HQL hive2 action oozie jobs from Ready to Deploy to Done on the Analytics-Kanban board.
Sep 4 2019, 5:46 PM · Analytics-Kanban, Analytics
JAllemandou moved T231787: Correct oozie jobs parameterization from Ready to Deploy to Done on the Analytics-Kanban board.
Sep 4 2019, 5:46 PM · Analytics-Kanban, Analytics
JAllemandou moved T228747: Review all the oozie coordinators/bundles in Refinery to add alerting when missing from Ready to Deploy to Done on the Analytics-Kanban board.
Sep 4 2019, 5:46 PM · Analytics-Kanban, Analytics, Wikimedia-Portals
JAllemandou moved T215655: Generate edit totals by country by month/year from Ready to Deploy to Done on the Analytics-Kanban board.
Sep 4 2019, 5:46 PM · Patch-For-Review, Analytics-Kanban, Analytics

Sep 3 2019

JAllemandou added a comment to T231874: Rename oozie edit_hourly job.

Could we change oozie jobs and hdfs path and keep druid name?

Sep 3 2019, 1:39 PM · Analytics
JAllemandou created T231874: Rename oozie edit_hourly job.
Sep 3 2019, 1:24 PM · Analytics
JAllemandou created T231856: Cleanup refinery artifacts folder from unneeded jars.
Sep 3 2019, 8:32 AM · Analytics-Kanban, Analytics

Sep 2 2019

JAllemandou created T231828: Change HDFS balancer threshold.
Sep 2 2019, 3:57 PM · Analytics-Kanban, Analytics
JAllemandou moved T215655: Generate edit totals by country by month/year from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Sep 2 2019, 3:06 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T228747: Review all the oozie coordinators/bundles in Refinery to add alerting when missing from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Sep 2 2019, 3:06 PM · Analytics-Kanban, Analytics, Wikimedia-Portals
JAllemandou moved T231787: Correct oozie jobs parameterization from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Sep 2 2019, 3:06 PM · Analytics-Kanban, Analytics
JAllemandou moved T231787: Correct oozie jobs parameterization from Next Up to In Code Review on the Analytics-Kanban board.
Sep 2 2019, 8:44 AM · Analytics-Kanban, Analytics