Page MenuHomePhabricator

Ottomata (Andrew Otto)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 9 2014, 4:50 PM (526 w, 4 d)
Availability
Available
IRC Nick
ottomata
LDAP User
Ottomata
MediaWiki User
Ottomata [ Global Accounts ]

Recent Activity

Yesterday

Ottomata added a comment to T373144: [SPIKE] Learn and document how to use Flink-CDC from MediaWiki MariaDB locally.

is this what you were hoping to see?

Mon, Nov 11, 7:29 PM · Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a project to T379158: Monthly pageview stats for Oct 2024 missing: Data Products.
Mon, Nov 11, 5:50 PM · Data Products, Data-Engineering

Fri, Nov 8

Ottomata added a comment to T376063: Hypothesis WE5.2.3 (Q2 FY24/25): Introduce a system of events and listeners into MediaWiki core.

Also

Fri, Nov 8, 8:09 PM · Epic, MediaWiki-Core-Hooks, MW-Interfaces-Team, OKR-Work, FY2024-25 KR 5.2 Simplify feature development
Ottomata added a comment to T376063: Hypothesis WE5.2.3 (Q2 FY24/25): Introduce a system of events and listeners into MediaWiki core.

Reading

Fri, Nov 8, 7:56 PM · Epic, MediaWiki-Core-Hooks, MW-Interfaces-Team, OKR-Work, FY2024-25 KR 5.2 Simplify feature development
Ottomata added a project to T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU.: Data-Platform-SRE.
Fri, Nov 8, 4:23 PM · Data-Platform-SRE, Goal, Machine-Learning-Team
Ottomata updated subscribers of T379385: Plan for a Hadoop and Hive upgrade for the Data Platform.
Fri, Nov 8, 4:20 PM · Epic, Data-Engineering, Data-Platform-SRE
Ottomata added projects to T377362: EPIC: Trino/minIO/Hive/Hadoop Implementation: Data-Platform, Data-Engineering, Data-Platform-SRE.
Fri, Nov 8, 2:23 PM · Data-Platform-SRE, Data-Engineering, Data-Platform, Epic, BDC-Implementation
Ottomata added a project to T367123: [minIO] Investigate packaging, install, security monitoring.: Data-Platform-SRE.
Fri, Nov 8, 2:20 PM · Data-Platform-SRE, BDC-Implementation, fundraising-tech-ops, SecTeam-Processed, Privacy Engineering, Security Preview
Ottomata added a comment to T377635: [Hive] Investigate packaging, install, security monitoring..
Fri, Nov 8, 2:18 PM · Data-Platform-SRE, fundraising-tech-ops
Ottomata added a project to T377635: [Hive] Investigate packaging, install, security monitoring.: Data-Platform-SRE.
Fri, Nov 8, 2:17 PM · Data-Platform-SRE, fundraising-tech-ops

Thu, Nov 7

Ottomata added a comment to T379265: Cleanup the UNIX runuser accross all the airflow instances .

Interesting!

run as different users by passing the spark.kerberos.principal parameter to spark

Thu, Nov 7, 4:24 PM · Patch-For-Review, Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.

Thanks for this write up Antoine! This will make it much easier to remember this stuff in the future.

Thu, Nov 7, 2:50 PM · Patch-For-Review, Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

as far as I can tell, mediawiki-content-dump is the only production case where we are currently doing this.

Thu, Nov 7, 2:40 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)

Wed, Nov 6

Ottomata removed a project from T379031: Revert Metrics Platform changes that were made for the Recommended Content feature: Data-Engineering.
Wed, Nov 6, 9:20 PM · Metrics Platform, Wikipedia-Android-App-Backlog
Ottomata removed a project from T378909: eswiki most viewed pages from Spain 2015-2024 : Data-Engineering.
Wed, Nov 6, 9:17 PM · Data Products, Data-Platform
Ottomata added a project to T378549: Pageviews Analysis 3.0 (Vue + Codex): Data Products.
Wed, Nov 6, 9:11 PM · Data Products, Data-Engineering, Data-Engineering-Wikistats, Tool-wikistatistics2-0, PageViewInfo, Tool-Pageviews
Ottomata closed T378184: Rate Limited for data science project as Resolved.
Wed, Nov 6, 9:05 PM · Research, Data-Engineering
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

The downside of deploy-mode cluster is that driver logs are not available to the process that submits the spark job.

Wed, Nov 6, 7:57 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata updated the task description for T367315: [Epic] Replace Archiva with Gitlab artifact repositories.
Wed, Nov 6, 7:32 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Java-Scala-Standardization, Epic, Discovery-Search, Data-Platform-SRE
Ottomata updated the task description for T367315: [Epic] Replace Archiva with Gitlab artifact repositories.
Wed, Nov 6, 7:32 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Java-Scala-Standardization, Epic, Discovery-Search, Data-Platform-SRE
Ottomata added a comment to T367406: Migrate Python projects that depend on Archiva for deployment.

In T379187: Create new GitLab project: repos/wmf-packages repos/wmf-packages was created as a WMF global package registry. We can publish shareable python packages there.

Wed, Nov 6, 7:31 PM · Java-Scala-Standardization, Discovery-Search, Data-Platform-SRE
Ottomata added a parent task for T379187: Create new GitLab project: repos/wmf-packages: T367406: Migrate Python projects that depend on Archiva for deployment.
Wed, Nov 6, 7:29 PM · Data-Engineering, GitLab (Project Migration), Release-Engineering-Team
Ottomata added a subtask for T367406: Migrate Python projects that depend on Archiva for deployment: T379187: Create new GitLab project: repos/wmf-packages.
Wed, Nov 6, 7:29 PM · Java-Scala-Standardization, Discovery-Search, Data-Platform-SRE
Ottomata updated the task description for T367322: Create a Maven package registry in Gitlab.
Wed, Nov 6, 7:29 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
Ottomata added a project to T379187: Create new GitLab project: repos/wmf-packages: Data-Engineering.
Wed, Nov 6, 7:24 PM · Data-Engineering, GitLab (Project Migration), Release-Engineering-Team
Ottomata awarded T379187: Create new GitLab project: repos/wmf-packages a Love token.
Wed, Nov 6, 6:55 PM · Data-Engineering, GitLab (Project Migration), Release-Engineering-Team
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Volume mounts based on either managed PVs or images.

Wow TIL k8s images and OCI 'artifacts'. Interesting!

Wed, Nov 6, 6:53 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Does that sound right to you @Ottomata?

Wed, Nov 6, 6:43 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T228175: [Metrics Platform] Event Platform Client Libraries.

Can we decline / resolve this task?

Wed, Nov 6, 6:39 PM · Data Products, Analytics, Epic, Better Use Of Data, Product-Infrastructure-Team-Backlog-Deprecated
Ottomata renamed T228175: [Metrics Platform] Event Platform Client Libraries from Event Platform Client Libraries to [Metrics Platform] Event Platform Client Libraries.
Wed, Nov 6, 6:39 PM · Data Products, Analytics, Epic, Better Use Of Data, Product-Infrastructure-Team-Backlog-Deprecated
Ottomata added a subtask for T214430: Event Platform: Stream Connectors: T374341: Add support for Spark producers in Event Platform.
Wed, Nov 6, 6:38 PM · Data-Engineering, Analytics, Goal, Services (watching), Event-Platform
Ottomata added a parent task for T374341: Add support for Spark producers in Event Platform: T214430: Event Platform: Stream Connectors.
Wed, Nov 6, 6:38 PM · Patch-For-Review, Dumps 2.0 (Kanban Board), Data-Engineering (Q2 2024 October 1st - December 31th), Discovery-Search (Current work)
Ottomata added a comment to T374341: Add support for Spark producers in Event Platform.

@gmodena @pfischer can you please add a definition of done / deliverables for this task?

done.

Wed, Nov 6, 6:38 PM · Patch-For-Review, Dumps 2.0 (Kanban Board), Data-Engineering (Q2 2024 October 1st - December 31th), Discovery-Search (Current work)
Ottomata updated the task description for T374341: Add support for Spark producers in Event Platform.
Wed, Nov 6, 6:37 PM · Patch-For-Review, Dumps 2.0 (Kanban Board), Data-Engineering (Q2 2024 October 1st - December 31th), Discovery-Search (Current work)
Ottomata updated subscribers of T379179: Send captcha API response data to event logging.
Wed, Nov 6, 5:28 PM · Metrics Platform, Data Products, MediaWiki-extensions-EventLogging, ConfirmEdit (CAPTCHA extension), Data-Engineering
Ottomata added projects to T379179: Send captcha API response data to event logging: Data Products, Metrics Platform.
Wed, Nov 6, 5:27 PM · Metrics Platform, Data Products, MediaWiki-extensions-EventLogging, ConfirmEdit (CAPTCHA extension), Data-Engineering
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Just so I'm clear, we don't have the driver logs in Airflow at the moment though, do we?

Wed, Nov 6, 5:22 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Could we not just add a spark.jars default configuration option, pointing to an HDFS location of the iceberg jar?

It isn't?

Wed, Nov 6, 5:20 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T367322: Create a Maven package registry in Gitlab.

If @Gehel and @brennen generally agree, that seems okay? Perhaps putting WMF or wikimedia in the name could be useful?

Wed, Nov 6, 4:10 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
Ottomata added a comment to T255818: Refine drops $schema field values.

This should be fixed as part of the work done for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation. We can resolve this after we are done T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment

Wed, Nov 6, 2:41 PM · Data-Engineering, Event-Platform
Ottomata added a subtask for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation: T255818: Refine drops $schema field values.
Wed, Nov 6, 2:39 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
Ottomata added a parent task for T255818: Refine drops $schema field values: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
Wed, Nov 6, 2:39 PM · Data-Engineering, Event-Platform
Ottomata added a comment to T378920: Determine whether or not we need expose an internal service for yarn .

Hm, are we sure we need this? IIRC yarn client handles picking the hostname to talk to directly via yarn ResourceManager HA stuff.

Wed, Nov 6, 2:10 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Iceberg jar:

Perhaps it would be possible to retrieve this jar from HDFS, rather than bake it into the airflow image. Not sure what's best here.

Wed, Nov 6, 2:04 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata moved T347430: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query from Radar (External Teams) to Incoming (new tickets) on the Data-Engineering board.
Wed, Nov 6, 1:48 PM · Data-Engineering, Data-Platform-SRE, SRE Observability
Ottomata updated subscribers of T347430: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query.

@CDanis I believe people currently want this kind of work to be planned more strategically, and to prioritize it appropriately I agree this would be very useful!

Wed, Nov 6, 1:48 PM · Data-Engineering, Data-Platform-SRE, SRE Observability

Tue, Nov 5

Ottomata added a comment to T376752: Add cu_log_event and cu_private_event CheckUser tables to data lake.

The data in cu_private_event is needed to do proper analysis for T372702: editors are repeatedly getting logged out (August 2024), as so far it seems that the queries have been limited to just enwiki.

Tue, Nov 5, 4:56 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Privacy Engineering, CheckUser
Ottomata added a comment to T321850: Add schema diffing support to jsonschema-tools and run diff in CI.

My MR that adds a jsonschema-tools diff subcommand is nice. We could configured GitLab CI to run it in the schema repos using dyff, and get reviewable CI pipeline output like this:

Tue, Nov 5, 4:53 PM · Data Products, Data-Engineering, Event-Platform
Ottomata added a comment to T376841: Render human-readable schemas on schema.wikimedia.org.

 The problem will be that jsfh renders the schema file but also renders js and css

Tue, Nov 5, 4:36 PM · Data Products (Data Products Sprint 22), Data-Engineering, Event-Platform, Documentation
Ottomata added a comment to T376841: Render human-readable schemas on schema.wikimedia.org.

schema.wikimedia.org is currently hosted using nginx.

Tue, Nov 5, 4:34 PM · Data Products (Data Products Sprint 22), Data-Engineering, Event-Platform, Documentation
Ottomata added a comment to T326179: Expose revision revert risk scores in EventStreams.

Hm, yes I think both could be done!

Tue, Nov 5, 4:30 PM · Event-Platform, Data-Engineering, Machine-Learning-Team, Research
Ottomata added a project to T321850: Add schema diffing support to jsonschema-tools and run diff in CI: Data Products.
Tue, Nov 5, 1:14 AM · Data Products, Data-Engineering, Event-Platform

Mon, Nov 4

Ottomata added a comment to T367322: Create a Maven package registry in Gitlab.

I expect the access token to the central repo would be the only real stumbling block.

Mon, Nov 4, 9:09 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
Ottomata added a comment to T367322: Create a Maven package registry in Gitlab.

I think we should have 1 WMF wide global package registry in GitLab.
I think one registry would make help simplify configuration. Publishing and dependency configuration could then all use the same package registry location configuration.

Mon, Nov 4, 8:04 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
Ottomata added a comment to T367322: Create a Maven package registry in Gitlab.

Reposting a use case from https://phabricator.wikimedia.org/T367322#9884996

Mon, Nov 4, 5:58 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
Ottomata added a comment to T367322: Create a Maven package registry in Gitlab.

Okay, so it looks the answer to my question

Mon, Nov 4, 5:56 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
Ottomata awarded T378933: Explore mechanism for publishing domain events a Love token.
Mon, Nov 4, 5:50 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review, MW-Interfaces-Team
Ottomata added a comment to T378923: Sqoop all mysql tables from production replicas instead of CloudDB replicas.

Agree.

Mon, Nov 4, 5:49 PM · Data Products, Data-Engineering
Ottomata added a parent task for T370712: Upgrade to Pyspark ≥ 3.4: T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0.
Mon, Nov 4, 5:46 PM · Data-Platform-SRE
Ottomata added a parent task for T370713: Upgrade to Pyspark ≥ 3.5: T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0.
Mon, Nov 4, 5:46 PM · Data-Platform-SRE
Ottomata added subtasks for T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0: T370713: Upgrade to Pyspark ≥ 3.5, T370712: Upgrade to Pyspark ≥ 3.4.
Mon, Nov 4, 5:46 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a subtask for T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0: T378899: Upgrade to Spark 3.2 to support Spark lineage for Iceberg tables.
Mon, Nov 4, 5:43 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a parent task for T378899: Upgrade to Spark 3.2 to support Spark lineage for Iceberg tables: T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0.
Mon, Nov 4, 5:43 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Data-Catalog, Data Pipelines
Ottomata renamed T378899: Upgrade to Spark 3.2 to support Spark lineage for Iceberg tables from Upgrade to Spark 3.2 to support Spark lineage to Upgrade to Spark 3.2 to support Spark lineage for Iceberg tables.
Mon, Nov 4, 5:42 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Data-Catalog, Data Pipelines
Ottomata added a comment to T373144: [SPIKE] Learn and document how to use Flink-CDC from MediaWiki MariaDB locally.

Yeah I was thinking of trying that next too! So
MariaDb -> Debezium -> Kafka
-> Flink CDC Iceberg
or
-> Paimon Sync Database action.

Mon, Nov 4, 5:42 PM · Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a comment to T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.

Re 4/ Refined table that won't Refine with new process:

Mon, Nov 4, 3:59 PM · Patch-For-Review, Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added projects to T378786: Request Kerberos identity for jsn.sherman: SRE-Access-Requests, Data-Platform-SRE.
Mon, Nov 4, 3:24 PM · SRE, Data-Platform-SRE, SRE-Access-Requests, Data-Engineering
Ottomata added a comment to T367403: Validate CI integration so that Ci can release Maven artifacts on user's demand.

There was never an answer to some questions there.

Mon, Nov 4, 3:24 PM · Discovery-Search (Current work), Release-Engineering-Team (Radar), Data-Engineering (Q1 2024 July 1st - September 30th), Java-Scala-Standardization, Data-Platform-SRE
Ottomata added a comment to T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.

But keeping those 171 custom blocks statically in the MediaWiki-config repo is OK.

Still not understanding. Why are there 171 static custom blocks? Won't the defaults just add them automatically?

Mon, Nov 4, 2:58 PM · Patch-For-Review, Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata renamed T326179: Expose revision revert risk scores in EventStreams from Proposal: Create a stream end point for Revision Risk Model to Expose revision revert risk scores in EventStreams.
Mon, Nov 4, 2:01 PM · Event-Platform, Data-Engineering, Machine-Learning-Team, Research
Ottomata updated subscribers of T326179: Expose revision revert risk scores in EventStreams.
Mon, Nov 4, 2:00 PM · Event-Platform, Data-Engineering, Machine-Learning-Team, Research

Fri, Nov 1

Ottomata added a comment to T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.

Is there a reason why this can't be added to the default block, like we do for Gobblin?

Fri, Nov 1, 8:15 PM · Patch-For-Review, Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added projects to T378772: Decide on how data platform wants to monitor bundle sizes: Metrics Platform, Data Products.
Fri, Nov 1, 7:52 PM · Data Products, Metrics Platform, Patch-For-Review, MediaWiki-extensions-EventLogging, MediaWiki-extensions-WikimediaEvents, Data-Engineering, Test-Infrastructure
Ottomata added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

allowing metric owners to choose which mechanism and data store best suits them?

FWIW, with the event platform consolidation proposal, metrics owners do not have to choose. All of these metrics will go both to Prometheus and to the Data Lake automatically.

Fri, Nov 1, 4:29 PM · Patch-For-Review, Event-Platform, Data-Engineering, Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics

Thu, Oct 31

Ottomata added a comment to T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.

or adding the 171 block into the PHP file manually.

Thu, Oct 31, 7:02 PM · Patch-For-Review, Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a project to T376841: Render human-readable schemas on schema.wikimedia.org: Event-Platform.
Thu, Oct 31, 4:10 PM · Data Products (Data Products Sprint 22), Data-Engineering, Event-Platform, Documentation
Ottomata added a comment to T376841: Render human-readable schemas on schema.wikimedia.org.

You know what else would be cool!?

Thu, Oct 31, 4:09 PM · Data Products (Data Products Sprint 22), Data-Engineering, Event-Platform, Documentation
Ottomata renamed T377600: [refine] Add support for custom partitioning from [refine] Add support for extra partitioning to [refine] Add support for custom partitioning.
Thu, Oct 31, 3:31 PM · Data-Engineering, Data Products
Ottomata updated subscribers of T377600: [refine] Add support for custom partitioning.
Thu, Oct 31, 3:29 PM · Data-Engineering, Data Products

Tue, Oct 29

Ottomata updated the task description for T331399: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page.
Tue, Oct 29, 6:58 PM · Data-Engineering, Event-Platform, Machine-Learning-Team
Ottomata updated subscribers of T338792: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities.
Tue, Oct 29, 6:58 PM · Data-Engineering, Epic, Wikimedia Enterprise, Machine-Learning-Team, Event-Platform
Ottomata added a comment to T370424: Streamline Data Platform access approvals for WMF staff.

Also updated docs here:
https://wikitech.wikimedia.org/w/index.php?title=SRE%2FClinic_Duty%2FAccess_requests&diff=2239783&oldid=2224064

Tue, Oct 29, 6:41 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Data-Platform-SRE, SRE
Ottomata moved T370424: Streamline Data Platform access approvals for WMF staff from In Review to Done on the Data-Engineering (Q2 2024 October 1st - December 31th) board.
Tue, Oct 29, 5:58 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Data-Platform-SRE, SRE
Ottomata added a comment to T370424: Streamline Data Platform access approvals for WMF staff.

Merged!

Tue, Oct 29, 5:58 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Data-Platform-SRE, SRE
Ottomata added a comment to T378517: Requesting access to the analytics cluster for CDobbins.

Approved

Tue, Oct 29, 5:53 PM · SRE, SRE-Access-Requests
Ottomata added a comment to T370363: Document when to use Event Platform vs Metrics Platform for instrumentation.

<3 thank you!

Tue, Oct 29, 3:37 PM · Metrics Platform
Ottomata updated subscribers of T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Skein support in Kubernetes might not be required

Indeed!

Tue, Oct 29, 3:34 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
Ottomata added a comment to T377928: Ensure that we can submit spark jobs via `spark3-submit` from airflow .

we need to have the spark3-submit binary be a symlink to spark-submit, as we use it extensively in airflow-dags

Tue, Oct 29, 3:28 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Patch-For-Review
JAllemandou awarded T377739: [Refine Refactoring] Refine Data Quality - late events, RefineMonitor refactor, etc. a Yellow Medal token.
Tue, Oct 29, 10:46 AM · Data-Engineering (Q2 2024 October 1st - December 31th)

Mon, Oct 28

Ottomata added a comment to T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .

Incident report has been moved to Wikitech.

Mon, Oct 28, 8:51 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a comment to T373144: [SPIKE] Learn and document how to use Flink-CDC from MediaWiki MariaDB locally.

Just sent this email to the flink and paimon user email groups.

Mon, Oct 28, 4:15 PM · Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata added a comment to T375402: Tune Dumps 2.0 hourly ingestion jobs.

how do we help the revision level MERGE INTO? The suggested partitioning schema doesn't, because we have, at an hour level about ~150k events, and at a day level it is ~3.6M

Mon, Oct 28, 3:32 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)
Ottomata added a comment to T375402: Tune Dumps 2.0 hourly ingestion jobs.

@xcollazo maybe stupid idea:

Mon, Oct 28, 1:38 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)
Ottomata updated subscribers of T375402: Tune Dumps 2.0 hourly ingestion jobs.
Mon, Oct 28, 1:37 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)

Sat, Oct 26

Ottomata added a comment to T373144: [SPIKE] Learn and document how to use Flink-CDC from MediaWiki MariaDB locally.

I think MW will write various logs to files in ./cache, which is a mounted volume so you should be able to tail them from your host machine. Is there anything tricky in there?

Sat, Oct 26, 5:19 PM · Data-Engineering (Q2 2024 October 1st - December 31th)
Ottomata updated the task description for T368927: [Epic] Migrate Data Platform Engineering maintained git repos to GitLab.
Sat, Oct 26, 5:15 PM · Epic, Data-Engineering
Ottomata added a comment to T373144: [SPIKE] Learn and document how to use Flink-CDC from MediaWiki MariaDB locally.

Been trying to do MariaDB -> Kafka -> Paimon with flink-cdc pipeline + Paimon sync database action, but I got stuck on the Kafka -> Paimon part. Just sent this email to the flink and paimon user email groups.

Sat, Oct 26, 3:36 AM · Data-Engineering (Q2 2024 October 1st - December 31th)

Fri, Oct 25

Ottomata added a comment to T358373: [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions.

so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.

Fri, Oct 25, 6:46 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)
Ottomata closed T265966: Proposal: drop kafka-php dependency from MediaWiki as Resolved.

A quick codesearch (https://codesearch.wmcloud.org/search/?q=kafka-php&files=&excludeFiles=&repos=) and local grep yields no results, so this knot might have neatly tied itself

Fri, Oct 25, 4:05 PM · Data-Engineering-Icebox, Analytics-Radar, Platform Team Workboards (Clinic Duty Team), MediaWiki-General
Ottomata added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

Other use cases:

Fri, Oct 25, 4:01 PM · Patch-For-Review, Event-Platform, Data-Engineering, Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics