Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (479 w, 5 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Thu, Apr 18

JAllemandou added a comment to T361499: [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total).

Global execution times have been divided by 3 (10mins for 170 jobs). We are using a new launchers queue to launch small jobs and have scaled the airflow parallelization to 10 tasks. We can replicate this model to other jobs :)

Thu, Apr 18, 6:24 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T361499: [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total) from In Review to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Thu, Apr 18, 6:22 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou added a comment to T351117: Move analytics log from Varnish to HAProxy.

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

Thu, Apr 18, 1:59 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
JAllemandou awarded T362839: Create data-problem Phabricator tag a Yellow Medal token.
Thu, Apr 18, 12:59 PM · Project-Admins, Movement-Insights
JAllemandou added a comment to T362870: Define a strategy to deal with xml-dumps huge files on the datalake.

Some of the big files listed above are due to the dumps job not splitting files (for cawiki and cswiki for instance).
For the rest, big files come from big pages (many revisions and big text):

spark.sql("""
     | SELECT
     |   wiki_db,
     |   page_id,
     |   count(1) as revision_count,
     |   sum(revision_text_bytes) as text_weight
     | from wmf.mediawiki_history
     |   where snapshot='2024-03'
     |   and event_entity = 'revision'
     |   and event_type = 'create'
     |   and not revision_is_deleted_by_page_deletion
     | group by wiki_db, page_id
     | order by text_weight DESC
     | limit 20
     | """).show(100, false)
+-----------+--------+--------------+------------+                              
|wiki_db    |page_id |revision_count|text_weight |
+-----------+--------+--------------+------------+
|enwiki     |5137507 |1346638       |463589202714|
|ruwiki     |205407  |327734        |98327067901 |
|frwiki     |7846555 |213413        |95076531961 |
|dewiki     |9082349 |373094        |90919451888 |
|commonswiki|1894972 |750375        |88226610099 |
|enwiki     |5149102 |411056        |76082914673 |
|enwiki     |36395484|400807        |74132557523 |
|ruwiki     |148254  |131666        |72615443749 |
|enwiki     |2535910 |505135        |70774375375 |
|dewiki     |7076401 |218421        |66236030573 |
|enwiki     |11424955|220512        |64807978487 |
|enwiki     |972034  |401277        |61837314055 |
|enwiki     |68479621|74790         |57447551495 |
|enwiki     |1470141 |411101        |55771136049 |
|dewiki     |6529924 |200046        |55555153937 |
|zhwiki     |84599   |577396        |54912852986 |
|enwiki     |34745517|378634        |50069312070 |
|zhwiki     |284591  |165702        |49405881699 |
|ruwiki     |15920   |202182        |49323471028 |
|hewiki     |13822   |216934        |48579444220 |
+-----------+--------+--------------+------------+
Thu, Apr 18, 12:55 PM · Data-Engineering
JAllemandou renamed T362870: Define a strategy to deal with xml-dumps huge files on the datalake from Define a strategy to deal with xml-dumps huge files to Define a strategy to deal with xml-dumps huge files on the datalake.
Thu, Apr 18, 11:30 AM · Data-Engineering
JAllemandou created T362870: Define a strategy to deal with xml-dumps huge files on the datalake.
Thu, Apr 18, 11:23 AM · Data-Engineering

Wed, Apr 17

JAllemandou moved T361499: [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total) from In progress to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Wed, Apr 17, 9:48 AM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration from In Review to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Wed, Apr 17, 6:32 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T361242: Unique devices tables have missing or incorrect data for January and February 2024 from Ready to Deploy to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Wed, Apr 17, 6:32 AM · Data-Engineering, Movement-Insights, Patch-For-Review, Data-Platform
JAllemandou added a comment to T361242: Unique devices tables have missing or incorrect data for January and February 2024.

The problem has been fixed.
The bug has been introduced when we migrated downstream jobs of the unique-devices tables to use the new iceberg tables. Druid loading of unique devices happens in 3 jobs for each unique-devices type (per-domain and per-project-family): 1 daily job for daily uniques, 1 monthly job for monthly uniques, and 1 monthly job to compact daily uniques into monthly segments, and that's this job that was causing issues.
The bug was about wrongfully using the first-of-the-month date parameter instead of the table day field as date for ingestion: data for every day of the month was labelled with 1st of the month.

Wed, Apr 17, 6:32 AM · Data-Engineering, Movement-Insights, Patch-For-Review, Data-Platform

Tue, Apr 16

JAllemandou merged T362201: Fix and validate browser report DAG and queries into T354552: [Maintenance] Migrate ReportUpdater browser queries to Airflow.
Tue, Apr 16, 7:26 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou merged task T362201: Fix and validate browser report DAG and queries into T354552: [Maintenance] Migrate ReportUpdater browser queries to Airflow.
Tue, Apr 16, 7:26 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T354552: [Maintenance] Migrate ReportUpdater browser queries to Airflow from Ready to Deploy to In progress on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Tue, Apr 16, 7:25 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T361242: Unique devices tables have missing or incorrect data for January and February 2024 from In progress to Ready to Deploy on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Tue, Apr 16, 7:25 AM · Data-Engineering, Movement-Insights, Patch-For-Review, Data-Platform
JAllemandou reassigned T354552: [Maintenance] Migrate ReportUpdater browser queries to Airflow from JAllemandou to amastilovic.
Tue, Apr 16, 7:24 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mon, Apr 15

JAllemandou claimed T361499: [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total).
Mon, Apr 15, 8:59 AM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T361499: [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total) from Next Up to In progress on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Mon, Apr 15, 8:59 AM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)

Fri, Apr 12

JAllemandou added a project to T362201: Fix and validate browser report DAG and queries: Data-Engineering (Q4 2024 April 1st - June 30th).
Fri, Apr 12, 6:32 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Thu, Apr 11

JAllemandou added a comment to T359993: Slowdown when querying via Hive.

Indeed we wish people to use Spark or Presto instead of hive, and this is good example as to why :)

Thu, Apr 11, 1:36 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform

Fri, Apr 5

JAllemandou added a comment to T328472: analytics/refinery: Stop using git-fat.

Thank you so much @hashar for unblocking us!

Fri, Apr 5, 9:28 AM · Patch-For-Review, git-lfs, Release-Engineering-Team (Now this 🫠), Data-Engineering, Data-Platform-SRE, Scap
JAllemandou added a comment to T356762: [Refine refactoring] Extract refine schema management into a dedicated tool.

I prefer the "by functionality" organization, for separating schema vs data code.
I think we need the 2 different functions to make the Iceberg one delete data before inserting. And actually this could be discussed as well: I think we wish to have this by default in the Iceberg write function - do you agree?

Fri, Apr 5, 9:27 AM · Data-Engineering (Q4 2024 April 1st - June 30th), Patch-For-Review

Thu, Apr 4

JAllemandou added a comment to T356762: [Refine refactoring] Extract refine schema management into a dedicated tool.

I have been wondering about how to organize this code.
I was not willing to replicate the DataFrameToHive pattern due to the apply function indeed, and also to not put code for both schema and data management in the same place, as we are trying to split them functionally.
Should we have 2 lib files, one for schema and one for data, for both Hive and Iceberg? Or one file doing both as it is now?

Thu, Apr 4, 1:00 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Patch-For-Review
JAllemandou added a comment to T357472: Add movement insights group/users to MWH denormalize job alerts.

Done using the airflow variable.
I also sent a PR to have the defaults set: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/642

Thu, Apr 4, 8:49 AM · Data-Engineering, Data-Platform

Sat, Mar 30

JAllemandou added a comment to T361242: Unique devices tables have missing or incorrect data for January and February 2024.

Good catch milimetric! Reviewing right now, will deploy this next week.

Sat, Mar 30, 12:35 PM · Data-Engineering, Movement-Insights, Patch-For-Review, Data-Platform

Wed, Mar 27

JAllemandou added a comment to T357859: Skip Wikidata when loading XML dumps to the Data Lake.

Thanks a lot @nshahquinn-wmf :)

Wed, Mar 27, 1:47 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou added a comment to T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration.

Would we still want this integrated email functionality within refinery, when it's running under airflow?

Wed, Mar 27, 1:47 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Tue, Mar 26

JAllemandou added a comment to T359435: [Airflow] SparkSqlOperator fails when executing via Skein with master=local.

We currently have use-cases doing this exactly that work. there must have been another issue than the pone described here. I think this ticket is invalid.

Tue, Mar 26, 5:55 PM · Data-Engineering
JAllemandou renamed T360968: [Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS from [Developer Experience] [SPIKE] Investigate process to automate deployment of hdfs artifacts to [Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS.
Tue, Mar 26, 5:50 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike
JAllemandou added a comment to T357472: Add movement insights group/users to MWH denormalize job alerts.

Done using airflow variable mechanism.

Tue, Mar 26, 1:17 PM · Data-Engineering, Data-Platform

Mar 21 2024

JAllemandou added a comment to T358311: Check home/HDFS leftovers of goransm.

I have run our script to list user content in our various machines, the result is below.
@AndrewTavis_WMDE , I let you review and let us know when you have copied stuff you wish to keep, so that we can delete the rest.

Mar 21 2024, 9:19 AM · Wikidata, Wikidata Analytics (Kanban), Data-Platform-SRE

Mar 6 2024

JAllemandou renamed T359215: mediawiki_cirrussearch_request data is regularly late from mediawiki_cirrussearch_request refine job is regularly taking too long to run to mediawiki_cirrussearch_request data is regularly late.
Mar 6 2024, 1:45 PM · Performance Issue, Data-Platform

Feb 29 2024

JAllemandou added a comment to T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.

I think we're gonna use this ticket: https://phabricator.wikimedia.org/T262201

Feb 29 2024, 10:22 AM · Data-Engineering (Q4 2024 April 1st - June 30th), Data Products, Structured-Data-Backlog
JAllemandou moved T345771: Adapt Sqoop to pagelinks schema change from In Review to Done on the Data-Engineering (Sprint 9) board.
Feb 29 2024, 10:20 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou changed the point value for T345771: Adapt Sqoop to pagelinks schema change from 8 to 3.
Feb 29 2024, 8:32 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou claimed T345771: Adapt Sqoop to pagelinks schema change.
Feb 29 2024, 8:32 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou moved T345771: Adapt Sqoop to pagelinks schema change from Next Up to In Review on the Data-Engineering (Sprint 9) board.
Feb 29 2024, 8:32 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou added a comment to T342911: Data Quality Issue: Wikitext History Job fail / rerun in Airflow.

Nothing done on my end - possibly one of the 2 jobs failed for real?

Feb 29 2024, 7:31 AM · Data-Engineering (Q4 2024 April 1st - June 30th), Data Products, Movement-Metrics, Movement-Insights
JAllemandou added a comment to T355588: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes.

Indeed, the job will not be affected with next month changes:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blame/main/analytics/dags/clickstream/clickstream_monthly_dag.py#L66
We'll need to keep looking for when those change though :)

Feb 29 2024, 7:28 AM · Data-Engineering (Q4 2024 April 1st - June 30th), Data Products

Feb 28 2024

JAllemandou added a comment to T345771: Adapt Sqoop to pagelinks schema change.

We're gonna build a quickfix for next month sqoop to be successful (null values in dropped fields for some projects).

Feb 28 2024, 3:55 PM · Data-Engineering (Sprint 9), Data Products
JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from Ready to Deploy to Done on the Data-Engineering (Sprint 9) board.
Feb 28 2024, 11:30 AM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights

Feb 27 2024

JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from In Review to Ready to Deploy on the Data-Engineering (Sprint 9) board.
Feb 27 2024, 8:14 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from In progress to In Review on the Data-Engineering (Sprint 9) board.
Feb 27 2024, 5:36 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou moved T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration from In progress to In Review on the Data-Engineering (Sprint 9) board.
Feb 27 2024, 4:30 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from Next Up to In progress on the Data-Engineering (Sprint 9) board.
Feb 27 2024, 4:30 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights

Feb 22 2024

JAllemandou added a comment to T358196: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication.

I don't think this change would really affect queries performance, but I'm in favor of doing it for the benefit of relieving some pressure from Kerberos.

Feb 22 2024, 5:47 PM · Data-Platform-SRE, Data-Platform
JAllemandou updated subscribers of T358205: Investigate late/delayed Airflow task failure notifications.

Thank you for the thorough investigation @BTullis !
This example gives us more traction on the need to move toward goggle-groups instead of using mailman.
Let's see how this could be prioritized (ping @Ahoelzl and @Gehel :) )

Feb 22 2024, 1:01 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Platform
JAllemandou moved T357419: Turn off ReportUpdater jobs no longer used from In Review to Done on the Data-Engineering (Sprint 9) board.
Feb 22 2024, 12:57 PM · Data-Engineering (Sprint 9)
JAllemandou renamed T358210: Delete reportupdater jobs data/puppet-code from Delete reportupdater jobs data to Delete reportupdater jobs data/puppet-code.
Feb 22 2024, 12:54 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou created T358210: Delete reportupdater jobs data/puppet-code.
Feb 22 2024, 12:18 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou added a comment to T354557: Dataset Config Store.

Have we looked around to see if there are existing 'dataset' config formats/specs we can already use?

Feb 22 2024, 12:15 PM · Epic, Data-Engineering

Feb 21 2024

JAllemandou moved T357419: Turn off ReportUpdater jobs no longer used from Next Up to In Review on the Data-Engineering (Sprint 9) board.
Feb 21 2024, 6:19 PM · Data-Engineering (Sprint 9)
JAllemandou claimed T357419: Turn off ReportUpdater jobs no longer used.
Feb 21 2024, 6:19 PM · Data-Engineering (Sprint 9)
JAllemandou added a comment to T357859: Skip Wikidata when loading XML dumps to the Data Lake.

Implementation plan:

Feb 21 2024, 6:12 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou claimed T357859: Skip Wikidata when loading XML dumps to the Data Lake.
Feb 21 2024, 6:07 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights

Feb 12 2024

JAllemandou added a comment to T345771: Adapt Sqoop to pagelinks schema change.

Thank you so much @Ladsgroup for the recap.

Feb 12 2024, 3:37 PM · Data-Engineering (Sprint 9), Data Products

Feb 8 2024

JAllemandou updated subscribers of T345771: Adapt Sqoop to pagelinks schema change.

Hi @Ladsgroup,
I have a question for you: have all the projects been migrated to using the new linktarget table for the pagelinks table, even if their columns have not been removed?
I'm asking this for us to adapt our sqoop jobs, as we're starting to experience issues (only testwiki this month).

Feb 8 2024, 6:06 PM · Data-Engineering (Sprint 9), Data Products

Feb 7 2024

JAllemandou added a comment to T345771: Adapt Sqoop to pagelinks schema change.

This has started, testwiki schema has changed.
I'd also like to talk about https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L622 as the linktarget table is considerate private now.

Feb 7 2024, 4:12 PM · Data-Engineering (Sprint 9), Data Products
JAllemandou created T356866: [Data Quality] Update data_quality schemas to be compatible with Iceberg tables.
Feb 7 2024, 2:13 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Feb 5 2024

JAllemandou added a comment to T354692: [Data Quality] Implement basic data quality metrics for MW history.

Indeed! here is the code:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/MediawikiHistoryChecker.scala

Feb 5 2024, 6:04 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Feb 1 2024

JAllemandou added a comment to T356400: User aqsloader hasn't MODIFY permissions on image_suggestions.* Cassandra tables anymore.

Data engineering team has written some code for our cassandra-loading jobs to be able to read a password from a file on HDFS:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/utils/WmfCassandraAuthConfFactory.scala

Feb 1 2024, 5:16 PM · Patch-For-Review, Discovery-Search (Current work), Structured-Data-Backlog (Current Work), User-Eevans, Cassandra, Data Products
JAllemandou added a comment to T324017: Set up Spark SQL Server.

While that could be useful, the spark-thrift server doesn't support user impersonation. The StackOverflow ticket I have read points to https://github.com/apache/kyuubi. We could investigate this.

Feb 1 2024, 4:20 PM · Data-Platform-SRE
JAllemandou moved T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration from Next Up to In progress on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:31 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T352669: [Iceberg Migration] Migrate aqs hourly tables to Iceberg from Ready to Deploy to Done on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:31 AM · Data-Engineering (Sprint 8)
JAllemandou moved T352670: [Iceberg Migration] Migrate browser_general tables to Iceberg from In Review to Done on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:30 AM · Data-Engineering (Sprint 8)
JAllemandou moved T349743: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset from In Review to Done on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:30 AM · Data-Engineering (Sprint 8)

Jan 30 2024

JAllemandou added a comment to T356112: Generated Data Platform (neé AQS): remove (unused/uneeded) test_spark3_loading keyspace.

Go for it :)

Jan 30 2024, 9:25 AM · Cassandra

Jan 29 2024

JAllemandou added a comment to T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.

Thanks a log for not forgetting about this ticket @mfossati :)
the Data Engineering team is on the road toward providing you with (hopefully) an easy enough way to configure data deletion for your datasets.
In the meantime, manual deletion every now and then should be enough.
I don't think it's worth investing time on this before the new system comes in (probably a few months).
Is that ok for you?

Jan 29 2024, 2:31 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Data Products, Structured-Data-Backlog
JAllemandou added a comment to T355920: DISCUSS: Relocate Generated Data Platform (neé AQS) test/dev tables?.

Does your tooling let you control size and throughput?

Jan 29 2024, 2:28 PM · Cassandra

Jan 26 2024

JAllemandou added a comment to T355920: DISCUSS: Relocate Generated Data Platform (neé AQS) test/dev tables?.

I think this is a good idea :)
The smaller size shouldn't be an issue as we should not test scalability but functions.

Jan 26 2024, 2:10 PM · Cassandra

Jan 24 2024

JAllemandou added a comment to T297944: Set up regular-repairs for AQS cassandra cluster tables.

The task is old but the objective is still valid IMO.
We should talk to @Eevans about this.

Jan 24 2024, 9:26 PM · Cassandra, Data-Engineering
JAllemandou closed T299961: Investigate Superset query templating as a mean to optimize partition pruning as Declined.

Closing as the strategy is to migrate to Iceberg.

Jan 24 2024, 8:46 PM · superset.wikimedia.org, Product-Analytics, Data-Engineering

Jan 23 2024

JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from Blocked/Paused to Done on the Data-Engineering (Sprint 7) board.
Jan 23 2024, 2:54 PM · Data-Engineering (Sprint 7), Java-Scala-Standardization, Discovery-Search, Data Pipelines

Jan 19 2024

JAllemandou triaged T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey as Unbreak Now! priority.
Jan 19 2024, 8:18 AM · Data-Engineering (Sprint 7)
JAllemandou created T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey.
Jan 19 2024, 8:18 AM · Data-Engineering (Sprint 7)

Jan 18 2024

dcausse awarded T355352: Users in archiva-deployer group can't upload artifacts anymore. a Love token.
Jan 18 2024, 6:42 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou moved T354696: [Refine System] Define a concept and an approach for refactoring the Refine system from Next Up to In Review on the Data-Engineering (Sprint 7) board.
Jan 18 2024, 6:30 PM · Data-Engineering (Sprint 7)
JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from In Review to Blocked/Paused on the Data-Engineering (Sprint 7) board.
Jan 18 2024, 6:30 PM · Data-Engineering (Sprint 7), Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou added a comment to T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom,.

Blocked on https://phabricator.wikimedia.org/T355352

Jan 18 2024, 6:30 PM · Data-Engineering (Sprint 7), Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou moved T355352: Users in archiva-deployer group can't upload artifacts anymore. from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 7) board.
Jan 18 2024, 6:30 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou updated the task description for T355352: Users in archiva-deployer group can't upload artifacts anymore..
Jan 18 2024, 6:29 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou updated the task description for T355352: Users in archiva-deployer group can't upload artifacts anymore..
Jan 18 2024, 6:29 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou created T355352: Users in archiva-deployer group can't upload artifacts anymore..
Jan 18 2024, 6:29 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)

Jan 11 2024

JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from In progress to In Review on the Data-Engineering (Sprint 7) board.
Jan 11 2024, 6:59 PM · Data-Engineering (Sprint 7), Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou renamed T354803: Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects from Give user joal the right to create branches in the `wmf-jvm-parent-pom` gitlab project to Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects.
Jan 11 2024, 1:34 PM · GitLab (Auth & Access), Release-Engineering-Team

Jan 10 2024

JAllemandou closed T354803: Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects as Resolved.

@brennen has updated my rights on gitlab giving me ownership write on the project. problem solved.

Jan 10 2024, 9:14 PM · GitLab (Auth & Access), Release-Engineering-Team
JAllemandou created T354803: Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects.
Jan 10 2024, 9:09 PM · GitLab (Auth & Access), Release-Engineering-Team

Jan 9 2024

JAllemandou added a comment to T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom,.

The steward subgroup is to support wikimedia stewards, and therefore not the correct place for our project. We decided to put it in ci-tools, even if it's probably a bit stretch :)

Jan 9 2024, 1:58 PM · Data-Engineering (Sprint 7), Java-Scala-Standardization, Discovery-Search, Data Pipelines

Dec 12 2023

JAllemandou added a comment to T346463: Identify and label prefetch proxy data in our traffic.

So ya let's go with VCL!

+1

Dec 12 2023, 2:01 PM · Traffic, Movement-Insights, Data-Engineering

Dec 11 2023

JAllemandou added a comment to T350009: Coalesce SEAL output.

Now output files dropped to 1k! 🎉

Dec 11 2023, 6:17 PM · Structured-Data-Backlog (Current Work), Image-Suggestions

Dec 7 2023

JAllemandou updated subscribers of T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom,.

Hi @brennen - I've been told you could the one to ask this question to: I'd like to create a new gitlab project for our global JVM POM file, reused globally at the foundation (therefore not under a team's name). I have identified the ci-tools subgroup and the stewards subgroup, and wondered if you thought the later would be good? Thanks

Dec 7 2023, 5:42 PM · Data-Engineering (Sprint 7), Java-Scala-Standardization, Discovery-Search, Data Pipelines

Dec 5 2023

JAllemandou added a comment to T352577: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool().

You guys rock <3

Dec 5 2023, 8:28 PM · Data-Platform-SRE, Data-Engineering, Data Products
JAllemandou added a comment to T340863: Mechanism for error logging when doing MERGE INTO.

I'm also eager to check if we run into parquet-decompression issues as I think could happen. Thanks a lot for running those experiments @xcollazo :)

Dec 5 2023, 8:27 PM · Data Products (Data Products Sprint 05), Patch-For-Review, Dumps 2.0
JAllemandou updated subscribers of T346463: Identify and label prefetch proxy data in our traffic.

@JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion?

Dec 5 2023, 5:22 PM · Traffic, Movement-Insights, Data-Engineering
JAllemandou added a comment to T346463: Identify and label prefetch proxy data in our traffic.

If we start having data about which webrequest hits are prefetch or not, we definitely would be able to investigate! I'm in favor of moving fast and passing this header through as a new webrequest field. No change would be needd in Gobblin, only in wmf_raw.webrequest and wmf.webrequest schemas, as well as refine_webrequest hql to forward the field.

Dec 5 2023, 2:24 PM · Traffic, Movement-Insights, Data-Engineering
JAllemandou awarded T350106: Implement a spark job that converts a RDF triples table into a RDF file format a Burninate token.
Dec 5 2023, 8:39 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
JAllemandou updated subscribers of T326386: Use of "self" in callables is deprecated in php8.2 from liuggio/statsd-php-client package.

Thanks for the ping @Jdforrester-WMF. Data-engineering has not been using statsd as far as I know. We have helped the performance team in some of its usage if I recall correctly, it was work done with @Krinkle and Gilles, but we have not been maintaining or using statsd.
Let's talk and see how statsd used nowadays, as doc is old and talks about Graphite: https://wikitech.wikimedia.org/wiki/Graphite#Data_sources, https://wikitech.wikimedia.org/wiki/Statsd

Dec 5 2023, 8:37 AM · MediaWiki-Platform-Team, MediaWiki-libs-Stats, PHP 8.2 support, Upstream, MediaWiki-Vendor

Dec 4 2023

JAllemandou added a comment to T350009: Coalesce SEAL output.

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

Dec 4 2023, 4:20 PM · Structured-Data-Backlog (Current Work), Image-Suggestions
JAllemandou updated subscribers of T352577: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool().

Just added data-platform-SRE project to the list of projects. Ping @BTullis on this as well.

Dec 4 2023, 4:02 PM · Data-Platform-SRE, Data-Engineering, Data Products
JAllemandou added a project to T352577: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool(): Data-Platform-SRE.
Dec 4 2023, 4:01 PM · Data-Platform-SRE, Data-Engineering, Data Products