Page MenuHomePhabricator

xcollazo (Xabriel J. Collazo Mojica)
Staff Data Engineer for Wikimedia

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Jun 9 2022, 6:42 PM (200 w, 2 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
XCollazo-WMF [ Global Accounts ]

Recent Activity

Fri, Apr 10

xcollazo added a comment to T417694: Perform a one-time clean up of retained data sets in event_sanitize.

Confirming we aren't using the data in any of the /wmf/data/event_sanitized/mobilewikiapp* schemas, signing off on data deletion.

Fri, Apr 10, 7:39 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo created P90375 hdfs dfs -rms for mobilewikiapp*.
Fri, Apr 10, 7:31 PM
xcollazo created P90374 DROP TABLEs for mobilewikiapp*.
Fri, Apr 10, 7:30 PM
xcollazo added a comment to T391135: Iceberg table maintenance for tables under wmf_product database.

I think we implemented this via T373693. @KCVelaga_WMF can you confirm?

Fri, Apr 10, 5:05 PM · Data-Engineering, Product-Analytics
xcollazo updated subscribers of T422030: Surge in webrequest validation check.
Fri, Apr 10, 2:55 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Traffic
xcollazo placed T422030: Surge in webrequest validation check up for grabs.
Fri, Apr 10, 2:54 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Traffic
xcollazo reopened T422030: Surge in webrequest validation check as "Open".

Reopening and tagging Data-Engineering since we do still need to make the pipeline changes that @JAllemandou suggests:

Fri, Apr 10, 2:54 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Traffic

Thu, Apr 9

xcollazo updated the task description for T417694: Perform a one-time clean up of retained data sets in event_sanitize.
Thu, Apr 9, 1:51 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo created P90342 Paying technical debt with AI: an event_sanitized cleanup case study.
Thu, Apr 9, 1:51 PM

Wed, Apr 8

xcollazo updated the task description for T417694: Perform a one-time clean up of retained data sets in event_sanitize.
Wed, Apr 8, 7:26 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo updated subscribers of T417694: Perform a one-time clean up of retained data sets in event_sanitize.

@SNowick_WMF and @phuedx:

Wed, Apr 8, 6:31 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo assigned T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps to APizzata-WMF.
Wed, Apr 8, 5:47 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content
xcollazo updated subscribers of T417694: Perform a one-time clean up of retained data sets in event_sanitize.

We investigated two angles: the top offenders by file count and size (from the ticket description), and all schemas previously removed from the sanitization allowlist (identified via git log). This is not an exhaustive scan of the full namespace — there may be additional datasets worth reviewing. Findings are grouped by suggested action.

Wed, Apr 8, 1:42 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo moved T417694: Perform a one-time clean up of retained data sets in event_sanitize from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Wed, Apr 8, 12:58 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo claimed T417694: Perform a one-time clean up of retained data sets in event_sanitize.
Wed, Apr 8, 12:58 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo changed the status of T417694: Perform a one-time clean up of retained data sets in event_sanitize from Open to In Progress.
Wed, Apr 8, 12:57 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo moved T421982: Update Commons Impact Metrics allow-list March 2026 from Next Up to Done on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Wed, Apr 8, 12:55 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Commons-Impact-Metrics-Requests, Commons-Impact-Metrics
xcollazo moved T401296: Add a --output-dir argument to wikibase rdf and json dumps from In Review to Ready to Deploy on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Wed, Apr 8, 12:46 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service
xcollazo added a comment to T401296: Add a --output-dir argument to wikibase rdf and json dumps.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2080

Refactor WikibaseDump to use OOP pattern and add output_dir support

Wed, Apr 8, 12:46 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service

Tue, Apr 7

xcollazo moved T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines from In progress to Done on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Tue, Apr 7, 6:52 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo renamed T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines from Sunset in-pipeline metrics computation for the MW Content Pipeline to Sunset in-pipeline metrics computation for the MW Content Pipelines.
Tue, Apr 7, 4:04 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review

Mon, Apr 6

xcollazo updated the task description for T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines.
Mon, Apr 6, 8:22 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo moved T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines from Next Up to In progress on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Apr 6, 6:52 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo renamed T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines from Optimize metrics computation for the MW Content Pipeline to Sunset in-pipeline metrics computation for the MW Content Pipeline.
Mon, Apr 6, 6:51 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo changed the status of T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), from Open to In Progress.
Mon, Apr 6, 6:51 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
xcollazo closed T395139: MediaWiki Content History alerts too much for minor reconcile issues as Invalid.

Boldly closing this as out of date. Other tickets like T419055 have better context.

Mon, Apr 6, 6:47 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), DPE-Mediawiki-Content
xcollazo closed T395139: MediaWiki Content History alerts too much for minor reconcile issues, a subtask of T384382: Production-level file export (aka dump) of MW Content in XML, as Invalid.
Mon, Apr 6, 6:47 PM · Data-Engineering, DPE-Mediawiki-Content, Epic
xcollazo moved T419055: Monthly reconcile continues to emit a really large amount of events after user_id changes from Blocked/Paused to Done on the Data-Engineering (Q4 FS25/26 April 1st - June 30st) board.
Mon, Apr 6, 6:44 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo updated subscribers of T422040: Migrate clouddumps https/rsync interfaces behind LVS.

CC @BTullis

Mon, Apr 6, 5:21 PM · Patch-For-Review, Traffic, Data-Services, tools-infrastructure-team, Datasets-General-or-Unknown
xcollazo added a comment to T420974: when analyzing a Wikifunctions dump, parent_id in page creation revisions is sometimes 0 and sometimes None.

I concur that this looks like a bug.

Mon, Apr 6, 2:52 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Dumps-Generation
xcollazo added a comment to T419055: Monthly reconcile continues to emit a really large amount of events after user_id changes.

Ok confirming this particular issue seems to be solved now. Note how overall inconsistencies are way down:

Mon, Apr 6, 2:32 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo added a comment to T422327: phabricator public tasks dump is not being synced.

I confirm that the dump is being produced and exists on phab1004.

Mon, Apr 6, 2:10 PM · collaboration-services, Datasets-General-or-Unknown, Phabricator

Fri, Mar 27

xcollazo moved T420787: Visualizing inconsistencies and reconciles via Superset from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Mar 27, 8:32 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo added a comment to T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines.

I am now of the opinion that we should just sunset these metrics:

Fri, Mar 27, 3:13 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

Noting that this has only happened once so far.

Fri, Mar 27, 2:44 PM · Essential-Work, Data-Engineering, DPE-Mediawiki-Content
xcollazo removed a project from T408819: When wikis cannot be exported due to SiteInfo, don't fail them: Data-Platform-SRE (2026-03-27 - 2026-04-17).
Fri, Mar 27, 2:43 PM · Essential-Work, Data-Engineering, DPE-Mediawiki-Content
xcollazo added a comment to T420787: Visualizing inconsistencies and reconciles via Superset.
  1. T420787 — MWCH Data Quality Dashboard: Summary
Fri, Mar 27, 1:06 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Wed, Mar 25

xcollazo added a comment to T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.

Note further that there seems to be other streams with ValidationErrors over last 30 days, but I did not dig deeper than the following query:

Wed, Mar 25, 5:37 PM · Data-Engineering, Event-Platform
xcollazo updated subscribers of T421257: EventBus: Unable to deliver all events: 503: Service Unavailable.
Wed, Mar 25, 4:36 PM · Data-Engineering, Event-Platform
xcollazo added a comment to T420787: Visualizing inconsistencies and reconciles via Superset.

More serious EventBus error rate: T421257: EventBus: Unable to deliver all events: 503: Service Unavailable.

Wed, Mar 25, 4:36 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo created T421257: EventBus: Unable to deliver all events: 503: Service Unavailable.
Wed, Mar 25, 4:35 PM · Data-Engineering, Event-Platform
xcollazo moved T420752: Improvements to local dev environment for Airflow from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Wed, Mar 25, 4:20 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo added a comment to T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.

One (redacted) example from logstash:

Wed, Mar 25, 2:49 PM · Data-Engineering, Event-Platform
xcollazo updated the task description for T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.
Wed, Mar 25, 2:43 PM · Data-Engineering, Event-Platform
xcollazo added a comment to T420787: Visualizing inconsistencies and reconciles via Superset.

Minor bug on EventGate found: T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.

Wed, Mar 25, 1:53 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo created T421237: `mediawiki.page_change.v1`: two schema validation errors causing events to be silently dropped by EventGate.
Wed, Mar 25, 1:52 PM · Data-Engineering, Event-Platform

Tue, Mar 24

xcollazo updated subscribers of T420787: Visualizing inconsistencies and reconciles via Superset.

If there is a way to relate this to missing events in the event sources, I would be much obliged! Xabriel came up with a nice query for this but I haven't had time to explore it.

Folks often ask me "but how lossy are the events?". A visual chart that answers this would be very helpful!

Tue, Mar 24, 8:23 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo closed T419291: enwiki File Export failed for 2026-03-01 as Resolved.
Tue, Mar 24, 7:26 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T418780: HDFS usage dashboard is quadruple counting file counts and file sizes as Resolved.
Tue, Mar 24, 7:26 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T336738: Refactor our existing Airflow dags to use EasyDAG & DagProperties, a subtask of T336739: Post Oozie -> Airflow migration refactorings, as Resolved.
Tue, Mar 24, 7:26 PM · Data-Engineering-Icebox, Data-Engineering, Patch-For-Review, Epic, Data Pipelines
xcollazo closed T336738: Refactor our existing Airflow dags to use EasyDAG & DagProperties as Resolved.
Tue, Mar 24, 7:26 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T419889: Add Fixture Generation for KubernetesPodOperator Tasks as Resolved.
Tue, Mar 24, 7:25 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T418754: Do multiple code and data clean ups for content tables as Resolved.
Tue, Mar 24, 7:25 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T266374: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history.

Flagging for @Ahoelzl : This could be something we wish to consider.

Tue, Mar 24, 7:09 PM · Data-Engineering, Product-Analytics
xcollazo moved T420752: Improvements to local dev environment for Airflow from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Mar 24, 5:13 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo moved T419055: Monthly reconcile continues to emit a really large amount of events after user_id changes from In progress to Blocked/Paused on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Mar 24, 5:10 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo added a comment to T419055: Monthly reconcile continues to emit a really large amount of events after user_id changes.

(We are now waiting on the next monthly reconcile to happen to see where we are at.)

Tue, Mar 24, 5:10 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo added a comment to T418804: table_maintenance_iceberg_monthly permission issue fails task due to permission on Ivy cache artifact.

I got the same error running an airflow devenv while developing a Spark 3.3.2 DAG.

Ivy Default Cache set to: /tmp/runuser/ivy_spark3/cache
The jars for the packages stored in: /tmp/runuser/ivy_spark3/home/jars
org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.6.1 in mirrored
Exception in thread "main" java.io.FileNotFoundException: /tmp/runuser/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7-1.0.xml (Permission denied)
	at java.io.FileOutputStream.open0(Native Method)
Tue, Mar 24, 3:28 PM · Data-Engineering

Mon, Mar 23

xcollazo added a comment to T306550: Move dumps.wikimedia.org HTTP service behind CDN edge.

The option of telling the few consumers of the rsync to change the hostname/IP they're using to mirror us (to separate it from HTTP via the CDN) would be far simpler and avoid the need for such a proxy.

Mon, Mar 23, 3:26 PM · Traffic, Datasets-General-or-Unknown, cloud-services-team, Data-Services

Fri, Mar 20

xcollazo created T420787: Visualizing inconsistencies and reconciles via Superset.
Fri, Mar 20, 7:58 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
xcollazo moved T420752: Improvements to local dev environment for Airflow from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Mar 20, 7:20 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo updated the task description for T420752: Improvements to local dev environment for Airflow.
Fri, Mar 20, 7:20 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo updated the task description for T420752: Improvements to local dev environment for Airflow.
Fri, Mar 20, 6:49 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo added a comment to T420752: Improvements to local dev environment for Airflow.
  1. Make local development compatible with git worktrees
Fri, Mar 20, 6:43 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo created T420752: Improvements to local dev environment for Airflow.
Fri, Mar 20, 4:28 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
xcollazo moved T401296: Add a --output-dir argument to wikibase rdf and json dumps from Next Up to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Mar 20, 3:53 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service
xcollazo added a comment to T401296: Add a --output-dir argument to wikibase rdf and json dumps.

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2080

Refactor WikibaseDump to use OOP pattern and add output_dir support

Fri, Mar 20, 3:52 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service
xcollazo added a comment to T419055: Monthly reconcile continues to emit a really large amount of events after user_id changes.

spark_process_reconciliation_events ingest for 2026-03-16 failed multiple times. I suspect this happened due to the same cluster issues reported on T419291#11723404: T420168 and T415002.

Rerunning as is now via https://yarn.wikimedia.org/proxy/application_1773845446826_10194

Fri, Mar 20, 3:28 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo moved T419291: enwiki File Export failed for 2026-03-01 from In progress to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Mar 20, 2:38 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T419291: enwiki File Export failed for 2026-03-01.

File export successful. All files now exposed publicly at https://dumps.wikimedia.org/other/mediawiki_content_history/enwiki/2026-03-01/xml/bzip2/.

Fri, Mar 20, 2:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Thu, Mar 19

xcollazo added a comment to T419055: Monthly reconcile continues to emit a really large amount of events after user_id changes.

spark_process_reconciliation_events ingest for 2026-03-16 failed multiple times. I suspect this happened due to the same cluster issues reported on T419291#11723404: T420168 and T415002.

Thu, Mar 19, 7:19 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review
xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I think it's easier to make it happen this way (reducing the mapper weight in sqoop script) than changing puppet. If ok for everyone, let's make it happen (with a comment in the code :) )

Thu, Mar 19, 7:06 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo moved T418780: HDFS usage dashboard is quadruple counting file counts and file sizes from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Mar 19, 6:25 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T418780: HDFS usage dashboard is quadruple counting file counts and file sizes.

For completeness, a link to the dashboard that was fixed: https://superset.wikimedia.org/superset/dashboard/409

Thu, Mar 19, 6:24 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T336738: Refactor our existing Airflow dags to use EasyDAG & DagProperties from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Mar 19, 1:37 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T336738: Refactor our existing Airflow dags to use EasyDAG & DagProperties.

Opened T420582: Migrate Airflow Search instance code away from deprecated VariableProperties to tackle that last bit of VariableProperties dependency.

Thu, Mar 19, 1:37 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo created T420582: Migrate Airflow Search instance code away from deprecated VariableProperties.
Thu, Mar 19, 1:36 PM · Discovery-Search (2026.03.03 - 2026.04.03)
xcollazo added a comment to T336738: Refactor our existing Airflow dags to use EasyDAG & DagProperties.

All usage of VariableProperties from within Airflow DAGs have now been migrated to DagProperties.

Thu, Mar 19, 1:34 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T419889: Add Fixture Generation for KubernetesPodOperator Tasks from Ready to Deploy to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Mar 19, 1:25 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T419291: enwiki File Export failed for 2026-03-01.

Failed again. There was cluster instability due to T420168 and T415002 so retrying as is.

https://yarn.wikimedia.org/proxy/application_1773845446826_0041

Thu, Mar 19, 12:29 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Wed, Mar 18

xcollazo renamed T401296: Add a --output-dir argument to wikibase rdf and json dumps from Add a --output-dir argument to wikibase rdf dumps to Add a --output-dir argument to wikibase rdf and json dumps.
Wed, Mar 18, 7:19 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service
xcollazo renamed T401296: Add a --output-dir argument to wikibase rdf and json dumps from Add a --date argument to wikibase rdf dumps to Add a --output-dir argument to wikibase rdf dumps.
Wed, Mar 18, 5:02 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service
xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

What if we spilt by another column like lu_local_id?

I don't think that using lu_local_id is a good idea because ut's not a table index, making splitting the data a lot less efficient.

Wed, Mar 18, 4:49 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

My way of dealing with that would be to change the puppet code to using 32 mappers.

Wed, Mar 18, 4:10 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I can also see that that we are doing a split-by on a String column when sqooping the centralauth_localuser table so it does makes sense. What if we spilt by another column like lu_local_id?
https://github.com/wikimedia/analytics-refinery/blob/35e7f416fe4bc9e3aeb194474a4fa803d8983823/python/refinery/sqoop.py#L1355

Wed, Mar 18, 4:04 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I've read our code and the bug report again, there is something I don't understand: the bug is supposed to happen when splitting a table on String type field, but we split on a Long type field:
https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L1320
I'd really like for us to investigate more. let's sync @APizzata-WMF .

Wed, Mar 18, 4:02 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T419291: enwiki File Export failed for 2026-03-01.

Failed again. There was cluster instability due to T420168 and T415002 so retrying as is.

Wed, Mar 18, 2:48 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T419889: Add Fixture Generation for KubernetesPodOperator Tasks from In Review to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Wed, Mar 18, 2:03 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo changed the status of T401296: Add a --output-dir argument to wikibase rdf and json dumps from Open to In Progress.
Wed, Mar 18, 2:02 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Wikidata, Wikidata-Query-Service

Tue, Mar 17

xcollazo moved T419291: enwiki File Export failed for 2026-03-01 from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Mar 17, 8:16 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T419291: enwiki File Export failed for 2026-03-01.

Rerunning via https://yarn.wikimedia.org/proxy/application_1764064841637_2255221/jobs/

Tue, Mar 17, 7:34 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T419291: enwiki File Export failed for 2026-03-01.

Ran the following as a one off:

$ hostname -f
an-launcher1003.eqiad.wmnet
$ whoami
analytics
Tue, Mar 17, 7:23 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T418780: HDFS usage dashboard is quadruple counting file counts and file sizes from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Mar 17, 5:52 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T419291: enwiki File Export failed for 2026-03-01.

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2096

mw_content_xml_export: fix enwiki timeout and clear output folder on retry

Tue, Mar 17, 5:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T418780: HDFS usage dashboard is quadruple counting file counts and file sizes from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Mar 17, 3:29 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T348774: [Maintenance] Add a deletion job for `hdfs_usage` data.

This continues to be an issue:

Tue, Mar 17, 3:27 PM · Data-Engineering
xcollazo updated subscribers of T418780: HDFS usage dashboard is quadruple counting file counts and file sizes.

Context: this dashboard was put together as part of T381707: Low available space on Hadoop / HDFS.

Tue, Mar 17, 3:22 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo claimed T418780: HDFS usage dashboard is quadruple counting file counts and file sizes.
Tue, Mar 17, 1:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo claimed T419291: enwiki File Export failed for 2026-03-01.
Tue, Mar 17, 1:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mon, Mar 16

xcollazo added a comment to T408918: Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17.

Being bold and reverting changes to mw-page-content-change-enrich to avoid inadvertently repeating T408918#11715866.

Mon, Mar 16, 9:10 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform, Essential-Work