JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (191 w, 6 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Mon, Oct 15

JAllemandou created T207012: Refactor refinery/oozie/mediawiki/load to use a loop instead of repeating code.
Mon, Oct 15, 11:19 AM · Analytics
JAllemandou closed T206883: mediawiki_history datasets have null user_text for IP edits as Declined.
Mon, Oct 15, 7:40 AM · Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou added a comment to T206883: mediawiki_history datasets have null user_text for IP edits.

The event_user_text field in wmf.mediawiki_history is what can be called the "current" value of the username, by opposition to the value it had at the time the edit was made which is stored in event_user_text_historical. Since IPs are stored at edit-time and we don't build user-history for IPs, the values are stored in `event_user_text_historical:

select count(distinct event_user_text) from wmf.mediawiki_history where snapshot='2018-09' and event_entity = 'revision' and event_user_is_anonymous and SUBSTR(event_timestamp, 0, 7) >= '2018-08'
Mon, Oct 15, 7:40 AM · Product-Analytics, Analytics-Data-Quality, Analytics

Wed, Oct 10

JAllemandou created T206687: Cleanup refinery artifact folder from old jars.
Wed, Oct 10, 6:21 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T170606: Add Accept header to webrequest logs.

@Ottomata - After standup please, as today is kids-day for me :)

Wed, Oct 10, 8:14 AM · Patch-For-Review, Analytics-Kanban, Operations, Traffic, Services (blocked), Analytics

Tue, Oct 9

JAllemandou added a comment to T153821: wikitech.wikimedia.org missing from pageviews API.

Right, I get it now :)
We discussed withbthe team and our plan is to change how we detect/filter pageviews from a domain perspective. We'll update the doc as needed when we're done refactoring.
In the meantime, we're going to add wikitech.wikimedia.org to the regex, providing pageviews for wikitech site.

Tue, Oct 9, 3:33 PM · Analytics, Pageviews-API, wikitech.wikimedia.org, Cloud-Services
JAllemandou added a comment to T153821: wikitech.wikimedia.org missing from pageviews API.

Why is this not documented on the wiki creation page?

I don't underdstand what the 'wiki creation page' is, but I think the current doc is correct about wikitech not being a pageview (https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters).

Tue, Oct 9, 8:41 AM · Analytics, Pageviews-API, wikitech.wikimedia.org, Cloud-Services
JAllemandou added a comment to T153821: wikitech.wikimedia.org missing from pageviews API.

wikitech is not part of the projects to account for in PageviewDefinition code (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L77). Easy to change if needed.

Tue, Oct 9, 7:50 AM · Analytics, Pageviews-API, wikitech.wikimedia.org, Cloud-Services

Fri, Oct 5

JAllemandou moved T164020: Use spark to split webrequest on tags from Paused to In Progress on the Analytics-Kanban board.
Fri, Oct 5, 4:01 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T202490: Automate XML-to-parquet transformation for XML dumps (oozie job) from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Fri, Oct 5, 3:56 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T202489: Copy monthly XML files from public-dumps to HDFS from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Fri, Oct 5, 3:56 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou added a comment to T206279: Hive join fails when using a HiveServer2 client.

Some more info on the issue as I understand it: Given the data size to work and the fact that hive.auto.convert.join is set to true by default, Hive converts the join to happen into a Map side join, and even more, into a LOCAL map side join. This means that the join happens locally on the client instead of happening onto a cluster container. When using the hive client, local memory can be configured using HADOOP_HEAPSIZE en variable. However when using hue or beeline, local actually means in HiveServer2.
The solution provided above (SET hive.auto.convert.join=false;) will work in any case, but it could interesting to try to raise HiveServer2 available memory and see if it solves the issue.

Fri, Oct 5, 7:35 AM · Analytics-Cluster, Analytics, Contributors-Analysis, Product-Analytics

Mon, Oct 1

Milimetric awarded T205457: Why do dumps and pageview api have slightly different counts? a Insectivore token.
Mon, Oct 1, 2:31 PM · Analytics-Kanban

Fri, Sep 28

JAllemandou moved T203175: Add endpoints to RESTBase for new WKS2 endpoints from Ready to Deploy to Done on the Analytics-Kanban board.
Fri, Sep 28, 4:02 PM · Analytics-Kanban

Thu, Sep 27

JAllemandou moved T204707: Update top-(editor/pages) endpoints in AQS to follow top-pageviews semantics from Ready to Deploy to Done on the Analytics-Kanban board.
Thu, Sep 27, 8:38 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T203403: Correct end point documentation about parameter specification on AQS api docs from Ready to Deploy to Done on the Analytics-Kanban board.
Thu, Sep 27, 8:38 PM · Analytics-Kanban
JAllemandou moved T202490: Automate XML-to-parquet transformation for XML dumps (oozie job) from In Progress to In Code Review on the Analytics-Kanban board.
Thu, Sep 27, 8:38 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T203258: Add project-families to AQS for additive-metrics from Ready to Deploy to Done on the Analytics-Kanban board.
Thu, Sep 27, 8:32 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T205457: Why do dumps and pageview api have slightly different counts? from Next Up to Done on the Analytics-Kanban board.
Thu, Sep 27, 1:46 PM · Analytics-Kanban
JAllemandou claimed T205457: Why do dumps and pageview api have slightly different counts?.
Thu, Sep 27, 1:45 PM · Analytics-Kanban
JAllemandou added a comment to T205457: Why do dumps and pageview api have slightly different counts?.

I have investigated on this topic and have some answers.

Thu, Sep 27, 1:45 PM · Analytics-Kanban
JAllemandou renamed T205620: Escape end-of-line or other special characters from pageview-dumps generation from Escape end-of-line special characters from pageview-dumps generation to Escape end-of-line or other special characters from pageview-dumps generation.
Thu, Sep 27, 1:45 PM · Analytics
JAllemandou created T205620: Escape end-of-line or other special characters from pageview-dumps generation.
Thu, Sep 27, 1:26 PM · Analytics
JAllemandou added a comment to T204981: Keeping Node services documentation in sync.

I agree with @Nuria on option 3 being a prefered solution for the AQS use-case.
The solution I'd like would be a single RESTBase module whose job would be to redirect any call to https://wikimedia.org/api/rest_v1/metrics/{PARAMS} to another external endpoint base to which PARAMS would be appended, without manipulating the parameters. This would let AQS act as the main server for requests.
I think however the use case for AQS might be simplistic in comparison to others, so the simplistic solution might not be the best.
Happy to talk about it further.

Thu, Sep 27, 12:42 PM · Services (designing), Reading-Infrastructure-Team-Backlog
JAllemandou created T205617: Update datasets to have explicit timestamp for druid indexation facilitation.
Thu, Sep 27, 12:17 PM · Analytics-Kanban, Analytics

Wed, Sep 26

JAllemandou moved T203403: Correct end point documentation about parameter specification on AQS api docs from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Wed, Sep 26, 4:41 PM · Analytics-Kanban
JAllemandou added a comment to T204572: Add index to mediawiki_page_create_3.

> @nettrom_WMF what can we do to get your queries moved to Hive instead of MySQL? You can use the hive table event.mediawiki_page_create.

Wed, Sep 26, 4:01 PM · Analytics-Kanban, Analytics

Tue, Sep 25

JAllemandou moved T204707: Update top-(editor/pages) endpoints in AQS to follow top-pageviews semantics from In Progress to Ready to Deploy on the Analytics-Kanban board.
Tue, Sep 25, 5:44 PM · Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time.

Jobs have been successful for the past months. However rerunning the jobs manually made the data-points appear. This is very bizarre.
Let's keep this open and monitor next month.

Tue, Sep 25, 3:06 PM · Wikidata-Campsite-Iteration-∞, User-Addshore, Patch-For-Review, Wikidata-Campsite, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T205367: Attempting to select all columns of mediawiki_history sometimes fails with a cryptic error message.

Hi @Neil_P._Quinn_WMF ,
The error in the first query comes from map memory being too small for the query you run. This can be solved by setting the memory limit for maps higher: SET mapreduce.map.memory.mb=4096;
Also, hue is not good in giving you proper error messages. I got the informative error message by running the query in a CLI from stat1004 (or 5).

Tue, Sep 25, 7:35 AM · Analytics-Kanban, Analytics, Analytics-Cluster, Contributors-Analysis, Product-Analytics
JAllemandou added a comment to T192959: Only hdfs (or authenticated user) should be able to run Druid indexing jobs.

The implemented solution is not a real one: it's an oozie check preventing running indexations on production datasources when user is not hdfs.
I think this task is about a real way to prevent indexations.

Tue, Sep 25, 7:21 AM · User-Elukey, Analytics

Thu, Sep 20

JAllemandou created T204958: Improve AQS request-parameter validation/normalization.
Thu, Sep 20, 2:40 PM · Analytics
JAllemandou moved T204707: Update top-(editor/pages) endpoints in AQS to follow top-pageviews semantics from Next Up to In Progress on the Analytics-Kanban board.
Thu, Sep 20, 2:01 PM · Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T204537: Document on wikitech presto choice for data lake .

See https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto

Thu, Sep 20, 2:00 PM · Analytics-Kanban
JAllemandou moved T204537: Document on wikitech presto choice for data lake from Next Up to In Code Review on the Analytics-Kanban board.
Thu, Sep 20, 2:00 PM · Analytics-Kanban
JAllemandou added a comment to T204765: Try multi group by in druid 0.11 with current data.

Tested with variations of this query:

time curl -L -H'Content-Type: application/json' -XPOST --data-binary '
{
  "queryType": "groupBy",
  "dataSource": {
    "type": "query",
    "query": {
      "queryType": "groupBy",
      "dataSource": "mediawiki_history_reduced_2018_08",
      "granularity": "day",
      "dimensions": [
        "user_text",
         {
            "type": "extraction",
            "dimension": "__time",
            "outputName": "time",
            "extractionFn": {
              "type": "timeFormat",
              "format": "yyyy-MM-ddZZ",
              "timeZone": "UTC",
              "locale": "en-US"
            }
          }
      ],
      "filter": {
        "type": "and",
        "fields": [
          { "type": "selector", "dimension": "event_entity", "value": "revision" },
          { "type": "regex", "dimension": "project", "pattern": ".*\\.wikipedia" }
        ]
      },
      "aggregations": [
        { "type": "longSum", "name": "events", "fieldName": "events" }
      ],
      "postAggregations": [],
      "intervals": [ "2018-01-01T00:00:00/2018-03-01T00:00:00" ]
    }
  },
  "filter": {
    "type": "and",
    "fields": [
      { "type": "bound", "dimension": "events", "lower": "1", "upper": "4" , "ordering": "numeric" }
    ]
  },
  "granularity": "all",
  "dimensions": [ "time" ],
  "limitSpec": { "type": "default", "limit": 1000 },
  "aggregations": [
    { "type": "count", "name": "users" }
  ],
  "postAggregations": [],
  "intervals": [ "2018-08-01T00:00:00/2018-09-01T00:00:00" ],
  "context": {
    "groupByStrategy": "v2"
  }
}
' http://druid1004.eqiad.wmnet:8082/druid/v2/
Thu, Sep 20, 1:24 PM · Analytics-Kanban, Analytics
JAllemandou moved T204765: Try multi group by in druid 0.11 with current data from Next Up to Done on the Analytics-Kanban board.
Thu, Sep 20, 1:17 PM · Analytics-Kanban, Analytics
JAllemandou set the point value for T204765: Try multi group by in druid 0.11 with current data to 1.
Thu, Sep 20, 1:17 PM · Analytics-Kanban, Analytics

Tue, Sep 18

MusikAnimal awarded T204707: Update top-(editor/pages) endpoints in AQS to follow top-pageviews semantics a Love token.
Tue, Sep 18, 4:22 PM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T204707: Update top-(editor/pages) endpoints in AQS to follow top-pageviews semantics.
Tue, Sep 18, 1:43 PM · Patch-For-Review, Analytics-Kanban
JAllemandou created T204707: Update top-(editor/pages) endpoints in AQS to follow top-pageviews semantics.
Tue, Sep 18, 1:43 PM · Patch-For-Review, Analytics-Kanban

Sep 11 2018

JAllemandou added a comment to T203906: Negative total number of bytes for German Wikipedia in 2001?.

The closest to total number of bytes we have is the metric you have checked. As explained previously, the negative aspect of it is due to historical changes, and should be negligible after 2005. For before, there is no easy solution as of now.

Sep 11 2018, 12:09 PM · Analytics, Analytics-Wikistats

Sep 10 2018

JAllemandou added a comment to T203906: Negative total number of bytes for German Wikipedia in 2001?.

My understanding of the question is: How come is it possible that the global-sum of net-bytes since the beginning of de.wikipedia.org can be negative?

Sep 10 2018, 5:27 PM · Analytics, Analytics-Wikistats
JAllemandou added a comment to T203826: Decide whether enable per-editor edits stats (community decision).

The data is not public as there are no public stats of edits per country. Your edit history does not include geographical locations where edits were made

Sep 10 2018, 4:26 PM · Community-consensus-needed, Analytics

Sep 7 2018

JAllemandou added a comment to T203826: Decide whether enable per-editor edits stats (community decision).

Thank you for the quick response ! We definitely don't want to enable the tools "silently". As for "perception" of privacy breach, this is exactly what we want to avoid. So let's make sure we don't hurt any feelings and don't enable the tool for now.

Sep 7 2018, 5:55 PM · Community-consensus-needed, Analytics
JAllemandou renamed T203826: Decide whether enable per-editor edits stats (community decision) from Per-editor edits stats - Community-liaison help to Decide whether enable per-editor edits stats (community decision).
Sep 7 2018, 5:42 PM · Community-consensus-needed, Analytics
JAllemandou added a project to T203824: Fix download-project-namespace-map script to send alert if it fails: Analytics.
Sep 7 2018, 5:40 PM · Analytics
JAllemandou created T203826: Decide whether enable per-editor edits stats (community decision).
Sep 7 2018, 5:38 PM · Community-consensus-needed, Analytics
JAllemandou created T203824: Fix download-project-namespace-map script to send alert if it fails.
Sep 7 2018, 5:34 PM · Analytics

Sep 6 2018

nettrom_WMF awarded T186559: Provide data dumps in the Analytics Data Lake a Love token.
Sep 6 2018, 11:02 PM · Research, Analytics

Sep 4 2018

JAllemandou moved T202490: Automate XML-to-parquet transformation for XML dumps (oozie job) from Next Up to In Progress on the Analytics-Kanban board.
Sep 4 2018, 4:15 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou added a project to T202490: Automate XML-to-parquet transformation for XML dumps (oozie job): Analytics-Kanban.
Sep 4 2018, 4:15 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T203403: Correct end point documentation about parameter specification on AQS api docs from Next Up to In Code Review on the Analytics-Kanban board.
Sep 4 2018, 4:11 PM · Analytics-Kanban
JAllemandou moved T202489: Copy monthly XML files from public-dumps to HDFS from In Progress to In Code Review on the Analytics-Kanban board.
Sep 4 2018, 4:11 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T203175: Add endpoints to RESTBase for new WKS2 endpoints from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Sep 4 2018, 4:11 PM · Analytics-Kanban
JAllemandou added a comment to T172532: Refactor analytics cronjobs to alarm on failure reliably.

The good thing is that all the crons become systemd units, that have their own journald configuration (so stdout is saved, and stderr should be as well but not separately) and they can trigger failure/alarms if the return code of the script is non zero. We can also ensure that only one occurrence of the script/cron/unit is executed at the same time (for example, if a long job is still running systemd will not create another occurrence of the timer/script until the other completes).

Sep 4 2018, 7:12 AM · Patch-For-Review, Analytics-Kanban, User-Elukey, Analytics

Aug 31 2018

JAllemandou renamed T203258: Add project-families to AQS for additive-metrics from Add project-families to backend for additive-metrics to Add project-families to AQS for additive-metrics.
Aug 31 2018, 3:19 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T203258: Add project-families to AQS for additive-metrics from In Progress to In Code Review on the Analytics-Kanban board.
Aug 31 2018, 3:19 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T203258: Add project-families to AQS for additive-metrics from Next Up to In Progress on the Analytics-Kanban board.
Aug 31 2018, 3:19 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou triaged T203258: Add project-families to AQS for additive-metrics as High priority.
Aug 31 2018, 3:18 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou changed the point value for T201620: Fix mediawiki-history-druid oozie job from 3 to 5.
Aug 31 2018, 10:04 AM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T203175: Add endpoints to RESTBase for new WKS2 endpoints from Next Up to In Code Review on the Analytics-Kanban board.
Aug 31 2018, 9:59 AM · Analytics-Kanban
JAllemandou claimed T203175: Add endpoints to RESTBase for new WKS2 endpoints.
Aug 31 2018, 9:59 AM · Analytics-Kanban
JAllemandou added a comment to T203175: Add endpoints to RESTBase for new WKS2 endpoints.

See https://github.com/wikimedia/restbase/pull/1056

Aug 31 2018, 9:59 AM · Analytics-Kanban
JAllemandou updated subscribers of T203180: Wikistats: add functions you apply to dimensional data such as "accumulate".

Quick note on that idea: As per the discussion with @fdans and @Nuria yestderday, it would be great to implement the notion of function to be applied to a dataset in a generic way, since other functions are already in mind (moving average for instance).

Aug 31 2018, 9:57 AM · Analytics-Kanban, Patch-For-Review, Analytics

Aug 30 2018

JAllemandou created T203175: Add endpoints to RESTBase for new WKS2 endpoints.
Aug 30 2018, 4:03 PM · Analytics-Kanban

Aug 29 2018

JAllemandou moved T201620: Fix mediawiki-history-druid oozie job from Ready to Deploy to Done on the Analytics-Kanban board.
Aug 29 2018, 3:03 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T201617: Add AQS endpoint providing top editors ("most prolific contributors") and top pages (by number of edits, by net-bytes-diff and abs-bytes diff) from Ready to Deploy to Done on the Analytics-Kanban board.
Aug 29 2018, 3:03 PM · Patch-For-Review, Analytics-Wikistats, Analytics-Kanban, Analytics
JAllemandou moved T192481: Add Mediawiki-History data-quality check stage in oozie using statistics from Ready to Deploy to Done on the Analytics-Kanban board.
Aug 29 2018, 3:03 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats
JAllemandou moved T192483: Add data-quality check on mediawiki-history-reduced before druid indexation from Ready to Deploy to Done on the Analytics-Kanban board.
Aug 29 2018, 3:03 PM · Analytics-Kanban, Patch-For-Review, Analytics-Wikistats
JAllemandou moved T202305: Update mediawiki-history-reduced datasource-name in druid to use underscore from Ready to Deploy to Done on the Analytics-Kanban board.
Aug 29 2018, 3:03 PM · Analytics-Kanban
JAllemandou moved T202269: Correct user registration date in mediawiki-history from Ready to Deploy to Done on the Analytics-Kanban board.
Aug 29 2018, 3:03 PM · Analytics-Kanban, Patch-For-Review

Aug 27 2018

JAllemandou moved T201620: Fix mediawiki-history-druid oozie job from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Aug 27 2018, 3:08 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T201617: Add AQS endpoint providing top editors ("most prolific contributors") and top pages (by number of edits, by net-bytes-diff and abs-bytes diff) from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Aug 27 2018, 3:08 PM · Patch-For-Review, Analytics-Wikistats, Analytics-Kanban, Analytics
JAllemandou created T202899: Approximate mediawiki-history user creation dates using user-id/registration-date coherence.
Aug 27 2018, 3:06 PM · Analytics

Aug 24 2018

JAllemandou added a comment to T186828: Productionize per-country daily & monthly active app user stats.

A comment on something that struck me today: for unique-devices computation using last-access-cookie, we store tables per country and we assume additivity of the metric accross countries, i.e. we do as if unique-devices don't change country. If we do the same asumption for mobile-uniques-devices, it means we can go with 2 jobs only instead of 4: we'd keep the old global values for historical purposes, and store only per-country daily and monthly moving forward. This would mean less jobs, less tables to maintain, less redundancy etc.

Aug 24 2018, 3:38 PM · Patch-For-Review, Product-Analytics, Analytics, Discovery-Analysis, Reading-analysis
JAllemandou added a comment to T201168: [Trailblaze] Create recommendation system prototype for property suggestions.

Hi @Jonas - A quick comment as per a quick chat with @Addshore on IRC. If you want to implement recommandation based on collaborative filtering for instance, I suggest you go for Spark MLLib (Spark Machine Learning LIBrary). It has all the classical ML algorithms, including collaborative filtering (https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html), and has currently a lot more traction than Mahout (the 'old' way). Let's talk when you want :)

Aug 24 2018, 1:40 PM · Analytics, wikidata-tech-focus, User-Jonas, Wikidata
JAllemandou awarded T202664: Calculate precisely number of unqiue users for IOS and Android in a privacy conscious manner that does not require opt in to send data a Love token.
Aug 24 2018, 7:21 AM · Wikipedia-iOS-App-Backlog, Product-Analytics, Analytics

Aug 22 2018

JAllemandou moved T201620: Fix mediawiki-history-druid oozie job from In Progress to In Code Review on the Analytics-Kanban board.
Aug 22 2018, 4:17 PM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T202489: Copy monthly XML files from public-dumps to HDFS.
Aug 22 2018, 3:09 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou added a comment to T202489: Copy monthly XML files from public-dumps to HDFS.

@diego I agree we could use NiFi to copy the files, but it seems too much work in term of system setup for this single use-case.

Aug 22 2018, 2:08 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou added a comment to T202489: Copy monthly XML files from public-dumps to HDFS.

I think the files dependencies are too complex for oozie here. Many projects !

Aug 22 2018, 1:42 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou added a comment to T202489: Copy monthly XML files from public-dumps to HDFS.

What I've done manually until now is referenced here: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/409960
I have started a python script to do pretty much the same thing with more flexibility and incremental growth.
Comments/ideas welcome :)

Aug 22 2018, 1:27 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T202489: Copy monthly XML files from public-dumps to HDFS from Next Up to In Progress on the Analytics-Kanban board.
Aug 22 2018, 7:55 AM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou created T202490: Automate XML-to-parquet transformation for XML dumps (oozie job).
Aug 22 2018, 7:54 AM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou claimed T186559: Provide data dumps in the Analytics Data Lake.
Aug 22 2018, 7:53 AM · Research, Analytics
JAllemandou created T202489: Copy monthly XML files from public-dumps to HDFS.
Aug 22 2018, 7:52 AM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou merged task T190858: Productionize MediaWiki content processing into T186559: Provide data dumps in the Analytics Data Lake.
Aug 22 2018, 7:51 AM · Analytics
JAllemandou merged T190858: Productionize MediaWiki content processing into T186559: Provide data dumps in the Analytics Data Lake.
Aug 22 2018, 7:51 AM · Research, Analytics
JAllemandou removed a parent task for T186559: Provide data dumps in the Analytics Data Lake: T190858: Productionize MediaWiki content processing.
Aug 22 2018, 7:51 AM · Research, Analytics
JAllemandou removed a subtask for T190858: Productionize MediaWiki content processing: T186559: Provide data dumps in the Analytics Data Lake.
Aug 22 2018, 7:51 AM · Analytics
JAllemandou added a subtask for T190858: Productionize MediaWiki content processing: T186559: Provide data dumps in the Analytics Data Lake.
Aug 22 2018, 7:50 AM · Analytics
JAllemandou added a parent task for T186559: Provide data dumps in the Analytics Data Lake: T190858: Productionize MediaWiki content processing.
Aug 22 2018, 7:50 AM · Research, Analytics

Aug 21 2018

JAllemandou moved T192481: Add Mediawiki-History data-quality check stage in oozie using statistics from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Aug 21 2018, 6:58 PM · Patch-For-Review, Analytics-Kanban, Analytics-Wikistats
JAllemandou moved T192483: Add data-quality check on mediawiki-history-reduced before druid indexation from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Aug 21 2018, 6:58 PM · Analytics-Kanban, Patch-For-Review, Analytics-Wikistats
JAllemandou moved T202305: Update mediawiki-history-reduced datasource-name in druid to use underscore from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Aug 21 2018, 11:22 AM · Analytics-Kanban
JAllemandou moved T202269: Correct user registration date in mediawiki-history from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Aug 21 2018, 11:22 AM · Analytics-Kanban, Patch-For-Review
JAllemandou moved T201617: Add AQS endpoint providing top editors ("most prolific contributors") and top pages (by number of edits, by net-bytes-diff and abs-bytes diff) from In Progress to In Code Review on the Analytics-Kanban board.
Aug 21 2018, 11:21 AM · Patch-For-Review, Analytics-Wikistats, Analytics-Kanban, Analytics
JAllemandou merged task T189620: Add wikistats metric "top-by-edits" into T201617: Add AQS endpoint providing top editors ("most prolific contributors") and top pages (by number of edits, by net-bytes-diff and abs-bytes diff).
Aug 21 2018, 11:21 AM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou merged T189620: Add wikistats metric "top-by-edits" into T201617: Add AQS endpoint providing top editors ("most prolific contributors") and top pages (by number of edits, by net-bytes-diff and abs-bytes diff).
Aug 21 2018, 11:21 AM · Patch-For-Review, Analytics-Wikistats, Analytics-Kanban, Analytics