Page MenuHomePhabricator

xcollazo (Xabriel J. Collazo Mojica)
Staff Data Engineer for Wikimedia

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Jun 9 2022, 6:42 PM (183 w, 2 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
XCollazo-WMF [ Global Accounts ]

Recent Activity

Fri, Dec 12

xcollazo added a comment to T409105: mediawiki.page_change.v1 event stream - Investigate mistmatched meta.dt and dt (and rev_dt) fields.

I think our model definitely has a gap if we are to account for imports.

Fri, Dec 12, 7:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), MW-Interfaces-Team, Event-Platform
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

The optimization_predicates computed with the 'set_of_page_ids' pushdown_strategy lists the page_id 178775087 and since it will not match with the page_id 100282687 already stored in the table for the revision_id 1118294298, the insert clause will be performed.
This causes the duplication of the revision_id 1118294298.

Awesome finding @APizzata-WMF !!

Fri, Dec 12, 5:48 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo moved T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter from Next Up to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Fri, Dec 12, 3:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo claimed T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.
Fri, Dec 12, 3:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo changed the status of T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter from Open to In Progress.
Fri, Dec 12, 3:53 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo added a comment to T412443: Handle `network_flows_internal` data growth.

@xcollazo what's the reason for the failure ? No disk space ?

Fri, Dec 12, 2:01 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Resuming all MW Content DAGs.

Fri, Dec 12, 1:36 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Thu, Dec 11

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

There are a lot of reconcile events to be ingested (See Flink app last 1h.) All these events are legit, but did inadvertedly come from an airflow-devenv instance. Will debug that issue separately.

Thu, Dec 11, 10:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a subtask for T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2): T412461: On reconcile, consider what happens when a restore and a delete happen on the same revision.
Thu, Dec 11, 9:42 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
xcollazo added a parent task for T412461: On reconcile, consider what happens when a restore and a delete happen on the same revision: T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2).
Thu, Dec 11, 9:42 PM · Data-Engineering, DPE-Mediawiki-Content
xcollazo created T412461: On reconcile, consider what happens when a restore and a delete happen on the same revision.
Thu, Dec 11, 9:42 PM · Data-Engineering, DPE-Mediawiki-Content
xcollazo added a comment to T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.

@Harej please confirm whether rsync works on your end.

Thu, Dec 11, 9:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo added a comment to T411598: Increase partitions of mediawiki.content_history_reconcile.v1.

(Side note: I just bumped TaskManagers to 20 temporarily due to T411803).

Thu, Dec 11, 9:20 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T412428: Wikidata full .json.gz dumps not published since 20250625.

20251210 is available at https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20251210/.

Thu, Dec 11, 8:15 PM · Wikidata, Data-Engineering, Dumps-Generation, Wikidata data dumps
xcollazo added a comment to T412443: Handle `network_flows_internal` data growth.

( For now, we have paused the pipeline, as it is continually failing: https://airflow.wikimedia.org/dags/druid_load_network_flows_internal_daily/grid )

Thu, Dec 11, 8:03 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.

Confirmed via contact email that the request is legit.

Thu, Dec 11, 5:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation

Wed, Dec 10

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Copy pasting from MR 88, for completeness:

Wed, Dec 10, 7:02 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

Yes, Presto is even better!

Wed, Dec 10, 2:59 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering, DPE-Mediawiki-Content
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

Thanks for the effort @brouberol!

Wed, Dec 10, 2:47 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering, DPE-Mediawiki-Content

Fri, Dec 5

xcollazo moved T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily from Urgent to Next Up on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Fri, Dec 5, 3:22 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo moved T411803: Fix reconcile bug where user_id is not being populated correctly. from Next Up to Urgent on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Fri, Dec 5, 3:21 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo moved T410688: Implement a new pipeline and table with reconciled historical revision data from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Fri, Dec 5, 3:19 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Thu, Dec 4

xcollazo added a comment to T411598: Increase partitions of mediawiki.content_history_reconcile.v1.

In my opinion, in general, it looks potentially a bit dangerous to have a single partition on a topic; for example, if due to an issue, we need to send 500M records for reconcile to that topic, we'll need days to process that, and one broker will get a lot of data suddenly.

Thu, Dec 4, 6:34 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

Opened T411803: Fix reconcile bug where user_id is not being populated correctly. to fix the reconcile issue.

Thu, Dec 4, 4:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo created T411803: Fix reconcile bug where user_id is not being populated correctly..
Thu, Dec 4, 4:53 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T411598: Increase partitions of mediawiki.content_history_reconcile.v1.

because this process runs only once a month.

Thu, Dec 4, 4:37 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Wed, Dec 3

xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I don't know if my other script runs have tidied all of those things up... But I doubt it, as they don't delete rows... But it will have fixed some of the lu_local_id is null entries as per T303590#11411628

Are those ocwiki dupes legit? ...

Wed, Dec 3, 7:02 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T360794: Implement stream of HTML content on mw.page_change event.

@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

Wed, Dec 3, 4:20 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Research, Event-Platform

Tue, Dec 2

xcollazo added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

I used joal.test_mediawiki_revision_history_druid, a test run of https://gerrit.wikimedia.org/r/1214023, to do some data validations.

Tue, Dec 2, 9:39 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Mon, Dec 1

xcollazo moved T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1) from Urgent to Next Up on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Mon, Dec 1, 3:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

If that works as expected, I'll try to integrate it into the dump DAGs.

Mon, Dec 1, 3:30 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering, DPE-Mediawiki-Content

Wed, Nov 26

xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

@JAllemandou table wmf_content.mediawiki_revision_history has been backfilled, and the Airflow DAG is now live at https://airflow.wikimedia.org/dags/mw_revision_merge_events_to_mw_revision_history_daily/grid

Wed, Nov 26, 10:16 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Backfilling rerun from T410688#11410585 finished successfully, but it struggled with a couple task failures:

Response code
Time taken: 10965.418 seconds
Wed, Nov 26, 9:50 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

How often do we run into this case? Could this be handled by documentation / run book for the person on OpsWeek?

Wed, Nov 26, 8:39 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering, DPE-Mediawiki-Content
xcollazo merged T411079: `central_auth.local_user` table contains corrupted records into T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.
Wed, Nov 26, 6:38 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo merged task T411079: `central_auth.local_user` table contains corrupted records into T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.
Wed, Nov 26, 6:38 PM · MediaWiki-Platform-Team, MediaWiki-Core-AuthManager
xcollazo added a comment to T411079: `central_auth.local_user` table contains corrupted records.

Ah, I just duplicated this one on T411116: CentralAuth's localuser table contains many nulls and duplicate mappings. There is more context on that other one, however, thus taking the liberty of closing this one in favor of the other.

Wed, Nov 26, 6:38 PM · MediaWiki-Platform-Team, MediaWiki-Core-AuthManager
xcollazo created T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.
Wed, Nov 26, 6:27 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Now running a backfilling SQL that deduplicates wmf_raw.centralauth_localuser with a heuristic:

Wed, Nov 26, 6:19 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Documentation from table doesn't help much: https://www.mediawiki.org/wiki/Extension:CentralAuth/localuser_table

Wed, Nov 26, 2:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Funnily enough, now there are now 4.3M rows more on the target table than the source

I have investigated @xcollazo finding, and it's not great: the centralauth.local_user table contains rows with NULL values for local_user_id for many projects, and for other projects (ocwiki and outreachwiki for the least) it has multiple rows for the same local_user_id and global_user_id...
This explains the row duplication :(
I have been trying the MERGE approach on my test table removing corrupted data, but the job still fails. I'll continue my investigations in that direction.

Wed, Nov 26, 2:53 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

(Opened T411100: Bump mem of an-launcher1003.eqiad.wmnet to 32GB to follow up on the driver memory limitation)

Wed, Nov 26, 2:32 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo created T411100: Bump mem of an-launcher1003.eqiad.wmnet to 32GB.
Wed, Nov 26, 2:31 PM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Funnily enough, now there are now 4.3M rows more on the target table than the source:

spark-sql (default)> SELECT count(1) FROM wmf_content.mediawiki_revision_history_v1                      
                   > ;
count(1)
7493104804
Time taken: 48.148 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT count(1) FROM wmf_content.mediawiki_content_history_v1
                   > ;
count(1)
7488763332
Time taken: 214.82 seconds, Fetched 1 row(s)
Wed, Nov 26, 2:05 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Tue, Nov 25

xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Moved the predicates to the ON condition so that they only apply to the right table.

...
FROM mwch_no_content c
LEFT JOIN wmf_raw.centralauth_localuser u
   ON ( u.snapshot='2025-10'
        AND u.wiki_db='centralauth'
        AND c.wiki_id = u.lu_wiki
        AND c.user_id = u.lu_local_id
   )
ORDER BY c.revision_dt
Tue, Nov 25, 8:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Hmm, the conditions must be wrong, as a count check is missing 1B rows:

spark-sql (default)> select count(1) from wmf_content.mediawiki_revision_history_v1;
count(1)
6548697332
Time taken: 32.89 seconds, Fetched 1 row(s)
spark-sql (default)> select count(1) from wmf_content.mediawiki_content_history_v1;
count(1)
7488763332
Time taken: 218.654 seconds, Fetched 1 row(s)
Tue, Nov 25, 8:49 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

DDL as of now: https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-content-pipelines/-/blob/e9689ce7be310560f796f1da3631e0880c42038e/hql/create-wmf_content_mediawiki_revision_history_v1.hql

Tue, Nov 25, 8:44 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Since we are going to need to backfill wmf_content.mediawiki_revision_history_v1 from wmf_content.mediawiki_content_history_v1, perhaps it is best if we backfill user_central_id to the latter first: T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1).

It's a possible solution. Another one is to backfill the user_central_id value only onto the new table, as it'll be a lot easier to rewrite (smaller size).

Tue, Nov 25, 1:47 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1).

I attempted to backfill this table yesterday, just for enwiki, and the query failed, so backfilling will need more tuning. Leaving the script here for later:

Tue, Nov 25, 1:44 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Mon, Nov 24

xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Had to restart the VM, and lost the mounted volume. We were missing a mount definition for /mnt/docker-scratch on fstab:

$ cat /etc/fstab 
PARTUUID=60e1fb21-856d-4220-8d87-f9d6ffcda7be / ext4 rw,discard,errors=remount-ro,x-systemd.growfs 0 1
PARTUUID=f9abe075-7aa9-4f0c-bc89-11cddd2df78b /boot/efi vfat defaults 0 0
UUID=e9d0b85d-9385-43b2-99bd-b5393b0d8e51 /mnt/docker-scratch ext4 defaults 0 2
Mon, Nov 24, 10:39 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo updated subscribers of T377023: Add CI step to event schema repositories to test to fail if a schema is deleted.

I'm not sure if we have similar cases in other repositories.

Mon, Nov 24, 9:32 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform

Fri, Nov 21

xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Since we are going to need to backfill wmf_content.mediawiki_revision_history_v1 from wmf_content.mediawiki_content_history_v1, perhaps it is best if we backfill user_central_id to the latter first: T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1).

Fri, Nov 21, 8:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo moved T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1) from Next Up to Urgent on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Fri, Nov 21, 8:29 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo moved T410688: Implement a new pipeline and table with reconciled historical revision data from Next Up to Urgent on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Fri, Nov 21, 8:29 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T377023: Add CI step to event schema repositories to test to fail if a schema is deleted.

I agree this particular check should not be a concern of jsonschema-tools.

Fri, Nov 21, 4:23 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform

Thu, Nov 20

xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Renamed relevant Gitlab project from:
https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump
to:
https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-content-pipelines

Thu, Nov 20, 8:12 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a parent task for T410688: Implement a new pipeline and table with reconciled historical revision data: T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes.
Thu, Nov 20, 7:59 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a subtask for T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes: T410688: Implement a new pipeline and table with reconciled historical revision data.
Thu, Nov 20, 7:59 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo created T410688: Implement a new pipeline and table with reconciled historical revision data.
Thu, Nov 20, 7:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T398160: Check home/HDFS leftovers of mnz.

Noting here that there are 7.7M HDFS files associated with user mnz. See latest HDFS usage dashboard at https://superset.wikimedia.org/superset/dashboard/409/?native_filters_key=bIfs5MQRkyBgn-V4IHFj_xsMhf3haMa0acFKr7ny_R-tWM1ZaxKHinKcr3X7yFr5.

Thu, Nov 20, 3:44 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work

Tue, Nov 18

xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

Restoring all MW Content pipelines.

Tue, Nov 18, 9:14 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

wmf_content.mediawiki_content_current_v1 also has duplicates:

spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         wiki_id,
         revision_id
  FROM wmf_content.mediawiki_content_current_v1
  GROUP BY wiki_id, revision_id
  HAVING count > 1
)
""").show(300, truncate=False)
Tue, Nov 18, 9:05 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

Following procedure from T404975#11197939:

Tue, Nov 18, 8:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo moved T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily from Next Up to Urgent on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Tue, Nov 18, 8:34 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo moved T408019: Upgrade gitlab-docker-runner to latest debian version from Urgent to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Tue, Nov 18, 8:34 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

Going to apply same fix as in T404975#11197939 and T404975#11198200.

Tue, Nov 18, 7:57 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

Total duplicates:

spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         wiki_id,
         revision_id
  FROM wmf_content.mediawiki_content_history_v1
  GROUP BY wiki_id, revision_id
  HAVING count > 1
)
""").show(300, truncate=False)
[Stage 2:====================================================>(1021 + 3) / 1024]
+-----+
|count|
+-----+
|2457 |
+-----+
Tue, Nov 18, 7:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

Looks like (yet) another instance of T404975: Another instance of duplicate rows on wmf_content.mediawiki_content_history_v1.

Tue, Nov 18, 7:12 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo changed the status of T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily from Open to In Progress.
Tue, Nov 18, 7:11 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.

To recap, this fix we merged should fix this issue from Oct 22 and on:

Tue, Nov 18, 2:39 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), DPE-Mediawiki-Content
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

I think we are done here for now. Bookworm should give us a 2-3 year runway.

Tue, Nov 18, 2:01 AM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408543: MTU setting in IPv6 VMs causes issues with Docker.
Tue, Nov 18, 2:00 AM · Patch-For-Review, cloud-services-team, Cloud-VPS
xcollazo updated the task description for T408019: Upgrade gitlab-docker-runner to latest debian version.
Tue, Nov 18, 1:55 AM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Deleted old gitlab-docker-runner VM instance, and deleted old gitlab-docker-runner-workspace volume to give back the resources to WMCS.

Tue, Nov 18, 1:54 AM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work

Mon, Nov 17

xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

gitlab-docker-runner-v2 successfully built the latest conda-analytcis artifact. 🎉

Mon, Nov 17, 9:15 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.
Mon, Nov 17, 8:45 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Ok I think I found root cause beinga mismatch MTU between host network and bridge network that docker creates:

xcollazo@gitlab-docker-runner-v2:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:5c:86:83 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet 172.16.20.94/21 metric 100 brd 172.16.23.255 scope global dynamic ens3
       valid_lft 67821sec preferred_lft 67821sec
    inet6 2a02:ec80:a000:1::13f/128 scope global dynamic noprefixroute 
       valid_lft 73212sec preferred_lft 73212sec
    inet6 fe80::f816:3eff:fe5c:8683/64 scope link 
       valid_lft forever preferred_lft forever
54: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d7:dd:02:ab brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
Mon, Nov 17, 5:26 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo moved T408019: Upgrade gitlab-docker-runner to latest debian version from Next Up to Urgent on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Mon, Nov 17, 4:11 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T401022: Implement the data layout, UI, and documentation for the XML file export.

Marked https://wikitech.wikimedia.org/wiki/MediaWiki_Content_File_Exports as paused.

Mon, Nov 17, 3:43 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo closed T409565: enwiki file export failed due to OOM as Resolved.
Mon, Nov 17, 3:13 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo closed Restricted Task, a subtask of T384382: Production-level file export (aka dump) of MW Content in XML, as Resolved.
Mon, Nov 17, 3:12 PM · Data-Engineering, DPE-Mediawiki-Content, Epic
xcollazo closed T407902: Certain languages on wmf_raw.mediawiki_project_namespace_map appear wrong as Resolved.
Mon, Nov 17, 3:09 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Fri, Nov 14

xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Tried setting a proxy via https://wikitech.wikimedia.org/wiki/HTTP_proxy, but no luck either.

Fri, Nov 14, 8:34 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Ok I can repro directly on by creating a manual container from the docker image:

xcollazo@gitlab-docker-runner-v2:~$ sudo docker image list
REPOSITORY                                                          TAG              IMAGE ID       CREATED       SIZE
registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper   x86_64-v18.5.0   e3cc82a0845d   2 hours ago   94.4MB
docker-registry.wikimedia.org/bullseye                              20251019         807cec67eba7   3 weeks ago   80.7MB
Fri, Nov 14, 8:22 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo claimed T408019: Upgrade gitlab-docker-runner to latest debian version.
Fri, Nov 14, 8:13 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

From cloudvps instance itself we can reach the offending hosts:

$ hostname -f
gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud
Fri, Nov 14, 8:12 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo edited projects for T408019: Upgrade gitlab-docker-runner to latest debian version, added: Data-Engineering (Q2 FY25/26 October 1st - December 31th); removed Data-Engineering.
Fri, Nov 14, 8:08 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo renamed T408019: Upgrade gitlab-docker-runner to latest debian version from Check what projects, if any, are still using gitlab-docker-runner to Upgrade gitlab-docker-runner to latest debian version.
Fri, Nov 14, 8:08 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo updated subscribers of T408019: Upgrade gitlab-docker-runner to latest debian version.

Currently stock on what seems to be a proxying issue:

Fri, Nov 14, 8:07 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

(Deleted old conda-analytics packages from https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/packages)

Fri, Nov 14, 5:33 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T321736: Modify conda-analytics CI pipeline to use a custom gitlab runner that can run docker.

(We migrated this server to latest debian version over at T408019: Upgrade gitlab-docker-runner to latest debian version).

Fri, Nov 14, 5:30 PM · Data Pipelines (Sprint 03)
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Similar to T321736#8357399:

Fri, Nov 14, 5:29 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Even though applying the docker puppet class had failed before (T410083#11372512), applying it now (via configs in horizon) was succeessful:

$ docker --version
Docker version 20.10.24+dfsg1, build 297e128
Fri, Nov 14, 5:16 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Following T321736#8353296 we added a 60GB volume to keep all the docker images:

ssh -J xcollazo@bastion.wmcloud.org xcollazo@gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud
Fri, Nov 14, 4:52 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo added a comment to T409970: Increase volume storage on project analytics.

Thanks!

Fri, Nov 14, 4:11 PM · Cloud-VPS (Quota-requests)
xcollazo closed T410083: gitlab-docker-runner-v2.analytics instance is inaccessible via SSH as Resolved.

Accessible again, closing.

Fri, Nov 14, 3:50 PM · VPS-Projects
xcollazo added a comment to T410083: gitlab-docker-runner-v2.analytics instance is inaccessible via SSH.

If the initial set of Puppet runs fail then the instance will get stuck in a weird state. I have manually kicked off a run which might or might not get it working again.

Fri, Nov 14, 3:49 PM · VPS-Projects

Nov 13 2025

xcollazo added a comment to T410083: gitlab-docker-runner-v2.analytics instance is inaccessible via SSH.

@taavi, I removed all puppet roles, and it still doesn't allow me to SSH in.

Nov 13 2025, 11:45 PM · VPS-Projects
xcollazo added a comment to T408019: Upgrade gitlab-docker-runner to latest debian version.

Hit a blocker with T410083: gitlab-docker-runner-v2.analytics instance is inaccessible via SSH.

Nov 13 2025, 7:48 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Essential-Work
xcollazo created T410083: gitlab-docker-runner-v2.analytics instance is inaccessible via SSH.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Nov 13 2025, 7:45 PM · VPS-Projects