User Details
- User Since
- Jun 9 2022, 6:42 PM (183 w, 2 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- XCollazo-WMF [ Global Accounts ]
Fri, Dec 12
I think our model definitely has a gap if we are to account for imports.
The optimization_predicates computed with the 'set_of_page_ids' pushdown_strategy lists the page_id 178775087 and since it will not match with the page_id 100282687 already stored in the table for the revision_id 1118294298, the insert clause will be performed.
This causes the duplication of the revision_id 1118294298.
Awesome finding @APizzata-WMF !!
Resuming all MW Content DAGs.
Thu, Dec 11
There are a lot of reconcile events to be ingested (See Flink app last 1h.) All these events are legit, but did inadvertedly come from an airflow-devenv instance. Will debug that issue separately.
@Harej please confirm whether rsync works on your end.
(Side note: I just bumped TaskManagers to 20 temporarily due to T411803).
20251210 is available at https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20251210/.
( For now, we have paused the pipeline, as it is continually failing: https://airflow.wikimedia.org/dags/druid_load_network_flows_internal_daily/grid )
Confirmed via contact email that the request is legit.
Wed, Dec 10
Copy pasting from MR 88, for completeness:
Yes, Presto is even better!
Thanks for the effort @brouberol!
Fri, Dec 5
Thu, Dec 4
In my opinion, in general, it looks potentially a bit dangerous to have a single partition on a topic; for example, if due to an issue, we need to send 500M records for reconcile to that topic, we'll need days to process that, and one broker will get a lot of data suddenly.
Opened T411803: Fix reconcile bug where user_id is not being populated correctly. to fix the reconcile issue.
because this process runs only once a month.
Wed, Dec 3
Tue, Dec 2
I used joal.test_mediawiki_revision_history_druid, a test run of https://gerrit.wikimedia.org/r/1214023, to do some data validations.
Mon, Dec 1
If that works as expected, I'll try to integrate it into the dump DAGs.
Wed, Nov 26
@JAllemandou table wmf_content.mediawiki_revision_history has been backfilled, and the Airflow DAG is now live at https://airflow.wikimedia.org/dags/mw_revision_merge_events_to_mw_revision_history_daily/grid
Backfilling rerun from T410688#11410585 finished successfully, but it struggled with a couple task failures:
Response code Time taken: 10965.418 seconds
Ah, I just duplicated this one on T411116: CentralAuth's localuser table contains many nulls and duplicate mappings. There is more context on that other one, however, thus taking the liberty of closing this one in favor of the other.
Now running a backfilling SQL that deduplicates wmf_raw.centralauth_localuser with a heuristic:
Documentation from table doesn't help much: https://www.mediawiki.org/wiki/Extension:CentralAuth/localuser_table
(Opened T411100: Bump mem of an-launcher1003.eqiad.wmnet to 32GB to follow up on the driver memory limitation)
Funnily enough, now there are now 4.3M rows more on the target table than the source:
spark-sql (default)> SELECT count(1) FROM wmf_content.mediawiki_revision_history_v1
> ;
count(1)
7493104804
Time taken: 48.148 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT count(1) FROM wmf_content.mediawiki_content_history_v1
> ;
count(1)
7488763332
Time taken: 214.82 seconds, Fetched 1 row(s)Tue, Nov 25
Moved the predicates to the ON condition so that they only apply to the right table.
...
FROM mwch_no_content c
LEFT JOIN wmf_raw.centralauth_localuser u
ON ( u.snapshot='2025-10'
AND u.wiki_db='centralauth'
AND c.wiki_id = u.lu_wiki
AND c.user_id = u.lu_local_id
)
ORDER BY c.revision_dtHmm, the conditions must be wrong, as a count check is missing 1B rows:
spark-sql (default)> select count(1) from wmf_content.mediawiki_revision_history_v1; count(1) 6548697332 Time taken: 32.89 seconds, Fetched 1 row(s) spark-sql (default)> select count(1) from wmf_content.mediawiki_content_history_v1; count(1) 7488763332 Time taken: 218.654 seconds, Fetched 1 row(s)
I attempted to backfill this table yesterday, just for enwiki, and the query failed, so backfilling will need more tuning. Leaving the script here for later:
Mon, Nov 24
Had to restart the VM, and lost the mounted volume. We were missing a mount definition for /mnt/docker-scratch on fstab:
$ cat /etc/fstab PARTUUID=60e1fb21-856d-4220-8d87-f9d6ffcda7be / ext4 rw,discard,errors=remount-ro,x-systemd.growfs 0 1 PARTUUID=f9abe075-7aa9-4f0c-bc89-11cddd2df78b /boot/efi vfat defaults 0 0 UUID=e9d0b85d-9385-43b2-99bd-b5393b0d8e51 /mnt/docker-scratch ext4 defaults 0 2
I'm not sure if we have similar cases in other repositories.
Fri, Nov 21
Since we are going to need to backfill wmf_content.mediawiki_revision_history_v1 from wmf_content.mediawiki_content_history_v1, perhaps it is best if we backfill user_central_id to the latter first: T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1).
I agree this particular check should not be a concern of jsonschema-tools.
Thu, Nov 20
Renamed relevant Gitlab project from:
https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump
to:
https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-content-pipelines
Tue, Nov 18
Restoring all MW Content pipelines.
wmf_content.mediawiki_content_current_v1 also has duplicates:
spark.sql("""
SELECT count(1) as count FROM (
SELECT count(1) as count,
wiki_id,
revision_id
FROM wmf_content.mediawiki_content_current_v1
GROUP BY wiki_id, revision_id
HAVING count > 1
)
""").show(300, truncate=False)Following procedure from T404975#11197939:
Going to apply same fix as in T404975#11197939 and T404975#11198200.
Total duplicates:
spark.sql("""
SELECT count(1) as count FROM (
SELECT count(1) as count,
wiki_id,
revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id, revision_id
HAVING count > 1
)
""").show(300, truncate=False)
[Stage 2:====================================================>(1021 + 3) / 1024]
+-----+
|count|
+-----+
|2457 |
+-----+Looks like (yet) another instance of T404975: Another instance of duplicate rows on wmf_content.mediawiki_content_history_v1.
To recap, this fix we merged should fix this issue from Oct 22 and on:
I think we are done here for now. Bookworm should give us a 2-3 year runway.
Deleted old gitlab-docker-runner VM instance, and deleted old gitlab-docker-runner-workspace volume to give back the resources to WMCS.
Mon, Nov 17
gitlab-docker-runner-v2 successfully built the latest conda-analytcis artifact. 🎉
Ok I think I found root cause beinga mismatch MTU between host network and bridge network that docker creates:
xcollazo@gitlab-docker-runner-v2:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:5c:86:83 brd ff:ff:ff:ff:ff:ff
altname enp0s3
inet 172.16.20.94/21 metric 100 brd 172.16.23.255 scope global dynamic ens3
valid_lft 67821sec preferred_lft 67821sec
inet6 2a02:ec80:a000:1::13f/128 scope global dynamic noprefixroute
valid_lft 73212sec preferred_lft 73212sec
inet6 fe80::f816:3eff:fe5c:8683/64 scope link
valid_lft forever preferred_lft forever
54: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:d7:dd:02:ab brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft foreverMarked https://wikitech.wikimedia.org/wiki/MediaWiki_Content_File_Exports as paused.
Fri, Nov 14
Tried setting a proxy via https://wikitech.wikimedia.org/wiki/HTTP_proxy, but no luck either.
Ok I can repro directly on by creating a manual container from the docker image:
xcollazo@gitlab-docker-runner-v2:~$ sudo docker image list REPOSITORY TAG IMAGE ID CREATED SIZE registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper x86_64-v18.5.0 e3cc82a0845d 2 hours ago 94.4MB docker-registry.wikimedia.org/bullseye 20251019 807cec67eba7 3 weeks ago 80.7MB
From cloudvps instance itself we can reach the offending hosts:
$ hostname -f gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud
Currently stock on what seems to be a proxying issue:
(Deleted old conda-analytics packages from https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/packages)
(We migrated this server to latest debian version over at T408019: Upgrade gitlab-docker-runner to latest debian version).
Similar to T321736#8357399:
Even though applying the docker puppet class had failed before (T410083#11372512), applying it now (via configs in horizon) was succeessful:
$ docker --version Docker version 20.10.24+dfsg1, build 297e128
Following T321736#8353296 we added a 60GB volume to keep all the docker images:
ssh -J xcollazo@bastion.wmcloud.org xcollazo@gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud
Thanks!
Accessible again, closing.
Nov 13 2025
@taavi, I removed all puppet roles, and it still doesn't allow me to SSH in.
Hit a blocker with T410083: gitlab-docker-runner-v2.analytics instance is inaccessible via SSH.
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!