JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (209 w, 5 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Today

JAllemandou moved T216414: Purge wikitext snapshots from In Progress to In Code Review on the Analytics-Kanban board.
Tue, Feb 19, 8:30 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T216414: Purge wikitext snapshots.
Tue, Feb 19, 8:30 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T216414: Purge wikitext snapshots.
Tue, Feb 19, 7:58 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T216414: Purge wikitext snapshots.

Ahhhh! I didn't get it - The checksum is computed once er argument set, and don't change if the dates worked by script change. This means I can get the checksum manually and set it up in the cron as a manual parameter :)
Thanks mforns for clarification :)

Tue, Feb 19, 7:57 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

One note about hadoop blobs: HDFS stores files split in chunks, with those not collocated. If we use transfer.py on local files after having brought them back from HDFS to the local machine, then yes, I'm assuming it would work. And I actually think the best would be to make transfer.py stream from standard-input so that files don't need to be moved from HDFS to local-filesystem, then to remote. Now the most efficient way to transfer data "hadoop-style" would be to prevent the hop to the local machine using transfer.py and stream it directly to its destination from the various datanodes - The main concern I can see here is that access is needed for all hadoop workers.

Tue, Feb 19, 7:56 AM · Research, Operations, Discovery, Analytics
JAllemandou updated subscribers of T216425: Volunteer NDA for AWight.

I support the idea. I however have no clue of how much formality must be put in place so that we can keep your access. @leila is probably the best person to engage with in that regard.

Tue, Feb 19, 7:50 AM · WMF-NDA-Requests

Yesterday

JAllemandou added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:

spark.sql("SET spark.sql.shuffle.partitions=128")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")
Mon, Feb 18, 4:30 PM · MediaWiki-Database, Wikidata, DBA, Analytics, Research
JAllemandou updated subscribers of T216414: Purge wikitext snapshots.

Strategy for deleting raw data would be to use the new refinery-drop-older-than script:

refinery-drop-older-than \
    --base-path=/wmf/data/raw/mediawiki/xmldumps/pages_meta_history \
    --path-format='(?P<year>[0-9]{4})(?P<month>[0-9]{2})01' \
    --older-than=90

@mforns : I'd like your view on how to automate getting the checksum to automate the job :)

Mon, Feb 18, 2:12 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T216414: Purge wikitext snapshots.
Mon, Feb 18, 2:10 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T216414: Purge wikitext snapshots from Next Up to In Progress on the Analytics-Kanban board.
Mon, Feb 18, 2:10 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T216414: Purge wikitext snapshots.
Mon, Feb 18, 2:09 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T216414: Purge wikitext snapshots.

Taking advantage of currently existing strategy and job for mediawiki-oriented snapshots, I have provided a patch alowing to keep 6 parquet snapshots. It's probably more than needed, but prevents having to change the drop-script.

Mon, Feb 18, 2:09 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T216414: Purge wikitext snapshots.
Mon, Feb 18, 2:07 PM · Patch-For-Review, Analytics-Kanban, Analytics

Sat, Feb 16

JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

Many thanks @ArielGlenn :)

Sat, Feb 16, 5:28 PM · Analytics, Dumps-Generation, Wikidata

Fri, Feb 15

JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

Works for me :) I assume the system would work similarly to the existing XML dumps, meaning that dumps would be generated in the same date folder (1st, 8th, 15th, 22nd of every month for instance), one after the other, providing information on availability in a json file?

Fri, Feb 15, 9:45 AM · Analytics, Dumps-Generation, Wikidata
JAllemandou added a comment to T215589: Migrate users to dbstore100[3-5].

@Neil_P._Quinn_WMF Hi !
We have planned to release change_tags raw table next month (february snapshot, released at beginning of March). The data will however probably not be integrated into mediawiki_history before the following snapshot (maybe, but there is a high risk that not). Will it be possible for ou to use the raw data to generate your report?

Fri, Feb 15, 9:43 AM · User-Marostegui, Analytics-Kanban, Analytics
JAllemandou added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

@ArielGlenn : Could we decide on regular day-in-month patterns for the various entity-dumps that need to be generated?
Here is my suggestion::

EntitiesFormatsCurrent Frequency New suggested frequency
alljson / nt / ttlEvery monday1st, 8th, 15th, 22nd of every month
truthyntEvery wednesday3rd, 10th, 17th, 24th of every month
lexemesnt / ttlEvery friday5th, 12th, 19th, 26th of every month
Fri, Feb 15, 9:39 AM · Analytics, Dumps-Generation, Wikidata

Thu, Feb 14

JAllemandou awarded T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday a Love token.
Thu, Feb 14, 5:47 PM · Analytics, Dumps-Generation, Wikidata
JAllemandou added a project to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday: Analytics.
Thu, Feb 14, 5:47 PM · Analytics, Dumps-Generation, Wikidata
JAllemandou created T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.
Thu, Feb 14, 5:43 PM · Analytics, Dumps-Generation, Wikidata
JAllemandou added a comment to T214384: [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema.

just to make sure I have the correct sequence of actions in mind:

  • Update event.navigationtiming hive table so that deviceMemory field is a double
  • Backfill refine job for the NavigationTiming schema as much as possible
  • Update not-backfilled data to make deviceMemory a double even without recovering the lost information (preventing querying failure)

Is that correct?

Thu, Feb 14, 11:22 AM · Analytics-Kanban, Patch-For-Review, Performance-Team (Radar), Analytics
JAllemandou added a comment to T215863: Coarse alarm on data quality for refined data based on entrophy calculations.

I imagine we would add entropy-stats tables generated hourly (for hourly datasets). The entropy-generation code could (and should!) be generic and reusable, and the alarming mechanism as well I guess.

Thu, Feb 14, 11:18 AM · Analytics
JAllemandou added a comment to T215987: Verify that hit/miss stats in WebRequest are correct.

Hey @Pchelolo - I think talking to the traffic team should be the way to go here.
I ran a query to get results for x_cache fields for 1 hour of webrequest (in spark):

spark.sql("select x_cache, count(1) as c from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 2 and day = 14 and hour = 7 group by x_cache order by c desc").show(50, false)
Thu, Feb 14, 11:10 AM · Traffic, Operations, Core Platform Team Backlog (Later), Analytics, Services (blocked), RESTBase
JAllemandou added a comment to T215082: Punjabi Wikisource WikiStats 2.0.

It should appear in February snapshot, generated in March, yes :)

Thu, Feb 14, 10:29 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Wikistats

Mon, Feb 11

JAllemandou added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :

spark.sql("SET spark.sql.shuffle.partitions=512")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")
Mon, Feb 11, 5:02 PM · MediaWiki-Database, Wikidata, DBA, Analytics, Research
JAllemandou added a comment to T212127: Clean up home dirs for users jamesur and nithum.

More info on this db: It contains sqooped data from tlwiki as naming suggests (number of revision coherent with recent snapshot). Data format is not optimal (hive-oriented, not even avro) and data is old compared to the one currently provided. I suggest we drop it.

Mon, Feb 11, 12:47 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

I wonder if the indformation present in the table mentioned is the same as the one we could extract from site-links in the wikidata items. @diego : Could you please triple check that? If it is the case, this task is another one in need of wikidata-json dumps being productionized :)

Mon, Feb 11, 11:45 AM · MediaWiki-Database, Wikidata, DBA, Analytics, Research
JAllemandou added a comment to T215442: Spike: Can Refine handle map types if Hive Schema already exists with map fields?.

Arf - Will try to be clearer: Instead of getting schema data from reading json and double read in case the schema you get from json is not cast-able to expected schema, could we gather the "real" schema from the schema repo and convert it to spark-schema, then read the json from that? I know you'd always prefer not to, but now that we're back to double-reading, maybe it'd make more sense? (for schemas with a big number of events, double reading will be expensive !)

Mon, Feb 11, 9:00 AM · Analytics-Kanban, EventBus, Analytics
JAllemandou added a comment to T215655: Generate edit totals by country by month.

Hey @Milimetric - Could we add "sum_edit_counts" to the existing dataset instead of creating a new one ?

Mon, Feb 11, 8:57 AM · Patch-For-Review, Analytics-Kanban, Analytics

Fri, Feb 8

JAllemandou added a comment to T213525: Update big spark jobs conf with better settings.

For the record @Milimetric : spilled files are the temporary files generated between steps when data doesn't fit in memory (they're called spilled because you first fill in memory, and it spills out to disk). For big jobs, those represent a lot of data and IOs.

Fri, Feb 8, 7:04 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T215442: Spike: Can Refine handle map types if Hive Schema already exists with map fields?.

I must say I also pushed for stopping double reading, and continue to think so.
I wonder if having a first refine step gathering schemas and converting json to spark schemas wouldn't make more sense here (and actually apply schema changes in hive from schema change detection only, before even reading).

Fri, Feb 8, 6:58 PM · Analytics-Kanban, EventBus, Analytics
JAllemandou created T215636: Move FR banner-impression jobs to events (lambda).
Fri, Feb 8, 5:09 PM · Analytics
JAllemandou moved T205940: Add change tag tables to monthly mediawiki_history sqoop from Paused to In Code Review on the Analytics-Kanban board.
Fri, Feb 8, 5:02 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-Cluster, Contributors-Analysis, Product-Analytics
JAllemandou added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

I like it, thanks @elukey !

Fri, Feb 8, 2:20 PM · Product-Analytics, Patch-For-Review, Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
JAllemandou awarded T92966: Machine readable interface for dumps.wikimedia.org a Party Time token.
Fri, Feb 8, 10:30 AM · Datasets-Archiving, Datasets-General-or-Unknown, Wikidata
JAllemandou added a comment to T215442: Spike: Can Refine handle map types if Hive Schema already exists with map fields?.

@Ottomata: Double reading is way to go when we have schema discrepancies that can't be solve through casting (struct -> map).
See test ran below:

case class TestMap(
    k1: String = "v1",
    k2: String = "v2",
    k3: String = "v3",
    k4: String = "v4",
    k5: String = "v5"
)
Fri, Feb 8, 9:45 AM · Analytics-Kanban, EventBus, Analytics
JAllemandou moved T215442: Spike: Can Refine handle map types if Hive Schema already exists with map fields? from Next Up to In Code Review on the Analytics-Kanban board.
Fri, Feb 8, 9:42 AM · Analytics-Kanban, EventBus, Analytics

Thu, Feb 7

JAllemandou created T215550: Test sqooping from the new dedicated labsdb host.
Thu, Feb 7, 8:52 PM · Analytics-Kanban, Analytics
JAllemandou merged task T215549: Update sqoop base script for new analytics-db infra into T215290: update mw scooping to be able to scoop from new db cluster .
Thu, Feb 7, 8:52 PM · Analytics-Kanban, Analytics
JAllemandou merged T215549: Update sqoop base script for new analytics-db infra into T215290: update mw scooping to be able to scoop from new db cluster .
Thu, Feb 7, 8:52 PM · Analytics-Kanban, Analytics
JAllemandou created T215549: Update sqoop base script for new analytics-db infra.
Thu, Feb 7, 8:49 PM · Analytics-Kanban, Analytics
JAllemandou moved T212928: [Spike] Spark job for digests-only mediawiki-history-reduced from In Progress to Paused on the Analytics-Kanban board.
Thu, Feb 7, 8:46 PM · Analytics, Analytics-Kanban
JAllemandou moved T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time from Paused to Done on the Analytics-Kanban board.
Thu, Feb 7, 8:34 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time.

I confirm the fix :) Closing this task.

Thu, Feb 7, 8:33 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou moved T215547: Bump graphframes version to 0.6.0+ from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Thu, Feb 7, 8:32 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T213525: Update big spark jobs conf with better settings from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Thu, Feb 7, 8:07 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T215547: Bump graphframes version to 0.6.0+ from Next Up to In Code Review on the Analytics-Kanban board.
Thu, Feb 7, 7:59 PM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T215547: Bump graphframes version to 0.6.0+.
Thu, Feb 7, 7:58 PM · Patch-For-Review, Analytics-Kanban
JAllemandou created T215547: Bump graphframes version to 0.6.0+.
Thu, Feb 7, 7:58 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T215082: Punjabi Wikisource WikiStats 2.0 from Ready to Deploy to Done on the Analytics-Kanban board.
Thu, Feb 7, 4:54 PM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Wikistats
JAllemandou moved T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 from In Code Review to Done on the Analytics-Kanban board.
Thu, Feb 7, 4:54 PM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002.

Thanks @Halfak for the double check :) Also, If you like hive speed, try Spark :D
Massive thanks again to @Marostegui for handling the death of the dbstore1002 beast :)

Thu, Feb 7, 4:53 PM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T154370: Create script for moving orphaned revisions to the archive table.

As part of the work the Analytics does on providing stats over (almost) all wikis, we can provide a list of orphan-revisions (no associated page_id, or no associated user_id/user_text even if rev_delete < 4). Please ping us if needed :)

Thu, Feb 7, 1:45 PM · Patch-For-Review, MediaWiki-Maintenance-scripts
JAllemandou added a comment to T213670: dbstore1002 Mysql errors.

sqoop for actor and comment tables just finished and we should use the new hardware next month, ,so no problem fir me either :)

Thu, Feb 7, 1:18 PM · Patch-For-Review, Operations, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou added a comment to T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002.

I suggest using hdfs://wmf/data/archive/sqldumps as a base for sql-dumps, with content-oriented subfolders, leading to hdfs://wmf/data/archive/sqldumps/mep_word_persistence/staging.sql.
Would that be ok for you @elukey ?

Thu, Feb 7, 12:18 PM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002.

@Marostegui : Indeed I started the job later (comment time is almostsynchronous).

Thu, Feb 7, 7:26 AM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou moved T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 from In Progress to In Code Review on the Analytics-Kanban board.
Thu, Feb 7, 7:21 AM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002.

@Marostegui : Sqoop finished yesterday at 23:31 UTC with expected number of rows. Maybe the db crash was unrelated?
@Halfak : Data is available for vetting in your hdfs user folder: /user/halfak/mep_word_persistence. I also created the related hive table:

use halfak;
CREATE EXTERNAL TABLE mep_word_persistence (  
  rev_id BIGINT,
  rev_timestamp STRING,
  page_id BIGINT,
  page_namespace BIGINT,
  page_title STRING,
  user_id BIGINT,
  user_text STRING,
  comment STRING,
  minor BOOLEAN,
  sha1 STRING,
  revisions_processed BIGINT,
  non_self_processed BIGINT,
  seconds_possible BIGINT,
  tokens_added BIGINT,
  persistent_tokens BIGINT,
  non_self_persistent_tokens BIGINT,
  censored BOOLEAN,
  non_self_censored BOOLEAN,
  sum_log_persisted DOUBLE,
  sum_log_non_self_persisted DOUBLE,
  sum_log_seconds_visible DOUBLE
) STORED AS PARQUET LOCATION "/user/halfak/mep_word_persistence";

Let's triple check the data is correct before calling it done :)

Thu, Feb 7, 7:19 AM · Analytics-Kanban, User-Elukey, Analytics

Wed, Feb 6

JAllemandou moved T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 from Next Up to In Progress on the Analytics-Kanban board.
Wed, Feb 6, 7:53 PM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002.

Job started with the command:

sudo -u hdfs sqoop import \
  -D mapred.job.name='sqoop-staging-mep_word_persistence' \
  --username research \
  --password-file /user/hdfs/mysql-analytics-research-client-pw.txt \
  --connect jdbc:mysql://analytics-store/staging \
  --query "select rev_id, convert(rev_timestamp using utf8) rev_timestamp, page_id, page_namespace, convert(page_title using utf8) page_title, user_id, convert(user_text using utf8) user_text, convert(comment using utf8) comment, minor, convert(sha1 using utf8) sha1, revisions_processed, non_self_processed, seconds_possible, tokens_added, persistent_tokens, non_self_persistent_tokens, censored, non_self_censored, sum_log_persisted, sum_log_non_self_persisted, sum_log_seconds_visible from mep_word_persistence where \$CONDITIONS" \
  --target-dir /user/joal/mep_word_persistence \
  --num-mappers 5 \
  --split-by rev_id \
  --as-parquetfile \
  --map-column-java "minor=Boolean,censored=Boolean,non_self_censored=Boolean,sum_log_persisted=Double,sum_log_non_self_persisted=Double,sum_log_seconds_visible=Double"

I hope 5 parallel workers is not too much ...

Wed, Feb 6, 7:53 PM · Analytics-Kanban, User-Elukey, Analytics
Marostegui awarded T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 a Love token.
Wed, Feb 6, 7:09 PM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5].

Did a quick test, it's working for me (one of them) :)
Thanks a milion @Marostegui and !@elukey

Wed, Feb 6, 5:27 PM · Patch-For-Review, User-Banyek, Analytics-Kanban, DBA, Analytics

Tue, Feb 5

JAllemandou moved T213525: Update big spark jobs conf with better settings from In Progress to In Code Review on the Analytics-Kanban board.
Tue, Feb 5, 8:52 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T213525: Update big spark jobs conf with better settings.

Doc available here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Spark_tuning_for_big_jobs

Tue, Feb 5, 8:52 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated the task description for T213525: Update big spark jobs conf with better settings.
Tue, Feb 5, 8:51 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T215082: Punjabi Wikisource WikiStats 2.0 from Next Up to In Code Review on the Analytics-Kanban board.
Tue, Feb 5, 9:40 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Wikistats
JAllemandou claimed T215082: Punjabi Wikisource WikiStats 2.0.
Tue, Feb 5, 9:40 AM · Analytics-Kanban, Patch-For-Review, Analytics, Analytics-Wikistats
JAllemandou added a comment to T215043: Upgrade to Spark 2.4.0.

If we upgrade, let's move to 2.4.0?

Tue, Feb 5, 9:30 AM · Analytics
JAllemandou added a comment to T215171: Archival of home directories on servers with very large homes.

For analytics-machines, should we use hdfs as an archive store?
And, if we archive, how long should we keep the archives?

Tue, Feb 5, 9:25 AM · Operations

Mon, Feb 4

JAllemandou added a comment to T211627: Mediawiki history has no data on IP blocks.

Thanks @nettrom_WMF for the follow up :)
I'll try to include that in the next bunch of big changes I'm working on for mediawiki-history :)

Mon, Feb 4, 5:32 PM · Anti-Harassment, Product-Analytics, Analytics
JAllemandou added a comment to T214897: data for analyzing and visualizing the identifier landscape of Wikidata.

Hi folks - Sorry for late answer, I was at WMF all-hands last week and did not check tasks.
I have started work work on having the wikidata-json dumps imported on the cluster, and while some data is available for ad-hoc analysis (see hdfs:///user/joal/wmf/data/wmf/mediawiki/wikidata_parquet), this dataset is not updated on a regular basis (not production-ready).
I however think that for a manual update every 3 month, it could be easy.
@GoranSMilovanovic - What do you think?

Mon, Feb 4, 8:46 AM · WMDE-Analytics-Engineering, User-GoranSMilovanovic, Wikidata

Tue, Jan 22

JAllemandou updated the task description for T213603: Coordinate work on minor changes for Edit Data Quality.
Tue, Jan 22, 8:21 AM · Patch-For-Review, Analytics-Kanban

Mon, Jan 21

JAllemandou added a comment to T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate.

I had in mind that one of the reason for using avro originally was data size. Currently the mediawiki_ApiAction takes ~1T per month and the mediawiki_CirrusSearchRequestSet ~2T per month. A small test with hive showed me that the growth factor from avro to json would be of ~4.5 for mediawiki_CirrusSearchRequestSet and ~3 for mediawiki_ApiAction (if the events don't change ). This would lead to ~12T monthly, which is largely acceptable for HDFS.
Last but not least: What about kafka?

Mon, Jan 21, 5:02 PM · Patch-For-Review, Services (watching), Discovery, Analytics-EventLogging, EventBus, Analytics

Jan 17 2019

JAllemandou added a comment to T213770: Remove Zero support in analytics.

To discuss with the team: Do we want to drop the column or nullifying the field would be enough? For webrequest, since data is dropped after 2 month, we can first nullify then drop for real. However for pageview, since data goes back in time, I think nullification must be done.

Jan 17 2019, 2:08 PM · Analytics-Kanban, Technical-Debt, Analytics
JAllemandou added a comment to T213716: Alarms for virtualpageview should exist (probably in oozie) for jobs that have been idle too long.

I have a suggestion. We could set the <timeout>XXX</timeout> control in oozie coordinators, replacing XXX with number of seconds before the materialized job times-out. When a job times-out it is marked as failed by oozie, sending us a failure email. I think an error email would make react stronger, as it means data will be missing. Also, the action might just be to re-run the job, which is simple in hue. Finally, this is the approach we have in webrequest-load, which has proven succesfull.

Jan 17 2019, 2:06 PM · Analytics
JAllemandou added a comment to T212778: Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset.

Turnilo needed a patch (webrequest_sampled_128 datasource had intrspection disabled), and I manually updated the columns in superset (automagic column scan failed...)

Jan 17 2019, 1:50 PM · Analytics-Kanban, Patch-For-Review, Analytics

Jan 11 2019

JAllemandou added a project to T205594: mediawiki_history missing page events: Analytics-Kanban.
Jan 11 2019, 9:24 AM · Analytics-Kanban, Analytics-Data-Quality, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou moved T211717: Clickstream job failing due to change of types of namespace column from Paused to Done on the Analytics-Kanban board.
Jan 11 2019, 9:22 AM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T213524: Add 'mediawiki_history_unchecked' dataset to oozie from Next Up to In Code Review on the Analytics-Kanban board.
Jan 11 2019, 9:20 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T213525: Update big spark jobs conf with better settings from Next Up to In Progress on the Analytics-Kanban board.
Jan 11 2019, 9:20 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T213525: Update big spark jobs conf with better settings.
Jan 11 2019, 9:19 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T213524: Add 'mediawiki_history_unchecked' dataset to oozie.
Jan 11 2019, 9:17 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T213524: Add 'mediawiki_history_unchecked' dataset to oozie.
Jan 11 2019, 9:15 AM · Patch-For-Review, Analytics-Kanban, Analytics

Jan 9 2019

JAllemandou moved T213290: Add Chinese Wikiversity edit-related metrics to Wikistats 2 from Next Up to Ready to Deploy on the Analytics-Kanban board.
Jan 9 2019, 5:24 PM · Chinese-Sites, Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou claimed T213290: Add Chinese Wikiversity edit-related metrics to Wikistats 2.
Jan 9 2019, 5:24 PM · Chinese-Sites, Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou added a comment to T213290: Add Chinese Wikiversity edit-related metrics to Wikistats 2.

It's not present in the wiki-list we sqoop: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv
Providing a patch now.

Jan 9 2019, 5:12 PM · Chinese-Sites, Analytics-Kanban, Patch-For-Review, Analytics

Jan 8 2019

JAllemandou moved T211717: Clickstream job failing due to change of types of namespace column from Ready to Deploy to Paused on the Analytics-Kanban board.
Jan 8 2019, 8:25 AM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T211000: Failure while refining webrequest upload 2018-12-01-14. Upgrade alarms from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2019, 8:25 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T212862: Update IP addresses of cloud labs to mark internal traffic on refinery code from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2019, 8:25 AM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou moved T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time from Ready to Deploy to Paused on the Analytics-Kanban board.
Jan 8 2019, 8:17 AM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou moved T209822: Add new wikis to analytics from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2019, 8:17 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T212778: Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset from Ready to Deploy to Done on the Analytics-Kanban board.
Jan 8 2019, 8:16 AM · Analytics-Kanban, Patch-For-Review, Analytics

Jan 7 2019

JAllemandou added a comment to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time.

Bug found and corrected (patches above).
Data is available now and the rerun problem should be solved.

Jan 7 2019, 9:15 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou moved T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Jan 7 2019, 9:02 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou moved T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time from Paused to In Code Review on the Analytics-Kanban board.
Jan 7 2019, 5:05 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou moved T210522: Refactor Sqoop, join actor and comment from analytics replicas from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Jan 7 2019, 4:15 PM · Analytics-Kanban, Analytics
JAllemandou moved T210542: Update datasets definitions and oozie jobs for dual-sqoop of comments and actors from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Jan 7 2019, 4:15 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T210543: Update refinery-source jobs to join labsdb with actor and comment from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Jan 7 2019, 4:15 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T212778: Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Jan 7 2019, 1:46 PM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou added a comment to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time.

Hi @WMDE-leszek - core data has not been computed et (usually done around the 9th of the following month).
I'll be sure to have an eye on data showing up for month 12 and rerun the job if needed.

Jan 7 2019, 9:46 AM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata