Page MenuHomePhabricator

Migrate the additional dump types from snapshot1016 to Airflow
Closed, ResolvedPublic

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Dumps_v1: Add missing path element in the wikibase sync destinationrepos/data-engineering/airflow-dags!1535btullisfix_wikibase_sync_pathmain
Dumps_v1: Fix the command for the shorturls dumprepos/data-engineering/airflow-dags!1502btullisfix_shorturls_dump_cmdmain
Dumps_v1: Convert the pagetitles dump into two DAGs.repos/data-engineering/airflow-dags!1499btullisfix_mediatitles_dumpmain
(fix) common: differentiate dev instance script environments with k8s devenvsrepos/data-engineering/airflow-dags!1485brouberolhotfix-common-util-root-folder-locationmain
test_k8s/dumps: deprecate schedule_interval in favor for schedulerepos/data-engineering/airflow-dags!1484brouberolT394389main
Add links to the dumps on airflow/k8s documentation in the DAGs documentationrepos/data-engineering/airflow-dags!1483brouberolT394389main
test_k8s/dumps/wikibase/lexemes: fix dump command argumentsrepos/data-engineering/airflow-dags!1470brouberolT394389main
Dumps_v1: Add the shorturls dumprepos/data-engineering/airflow-dags!1462btullisadd_shorturls_dumpmain
Dumps_v1: Fix some sync task parametersrepos/data-engineering/airflow-dags!1449btullisfix_sync_targetsmain
Dumps_v1: Add the cirrussearch dumpsrepos/data-engineering/airflow-dags!1445btullisadd_cirrussearch_dumpsmain
Dumps_v1: Add a DAG to execute the media info dumprepos/data-engineering/airflow-dags!1438btullisdumps_imageinfomain
Dumps_v1: Fix the categories dump job spec fetcherrepos/data-engineering/airflow-dags!1417btullisfix_categories_dumpsmain
Dumps_v1: Add the wikibase dumpsrepos/data-engineering/airflow-dags!1413btulliswikibase_dumpsmain
Dumps_v1: Add the categories RDF dumps DAG.repos/data-engineering/airflow-dags!1408btullisadd_categories_dumpmain
Add the mediawiki content translation dumprepos/data-engineering/airflow-dags!1362btullisadd_more_dumps_dagsmain
Fix the paths of some of the dumps scripts and config filesrepos/releng/release!181btullisfix_dump_script_symlinksmain
Add a DAG to run the pagetitles dumpsrepos/data-engineering/airflow-dags!1337btullisdumps_pagetitlesmain
mediawiki-cli: Add the dcat repository and enable some dumps scriptsrepos/releng/release!171btullisadd_dumps_scriptsmain
dumps: Add the addschanges dumprepos/data-engineering/airflow-dags!1325btullisadds_changes_dagmain
Show related patches Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There are some warnings coming from the draft addschanges dump of the following nature:

[2025-05-15, 14:19:42 UTC] {pod_manager.py:477} INFO - [base] [WARNING]: Warning: Undefined array key "x3" in /srv/mediawiki/src/etcd.php on line 136
[2025-05-15, 14:19:43 UTC] {pod_manager.py:477} INFO - [base] Warning: foreach() argument must be of type array|object, null given in /srv/mediawiki/src/etcd.php on line 136

But it doesn't seem to stop the dumps from completing.

Change #1147028 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps@master] Add a copy of the dump scripts that are in puppet

https://gerrit.wikimedia.org/r/1147028

btullis opened https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/171

mediawiki-cli: Add the dcat repository and enable some dumps scripts

Change #1147028 merged by Btullis:

[operations/dumps@master] Add a copy of the dump scripts that are in puppet

https://gerrit.wikimedia.org/r/1147028

dancy merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/171

mediawiki-cli: Add the dcat repository and enable some dumps scripts

Mentioned in SAL (#wikimedia-operations) [2025-05-19T14:54:24Z] <dancy@deploy1003> Started scap sync-world: Updating images for T394389

Mentioned in SAL (#wikimedia-operations) [2025-05-19T14:59:48Z] <dancy@deploy1003> dancy: Updating images for T394389 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-05-19T15:07:19Z] <dancy@deploy1003> Finished scap sync-world: Updating images for T394389 (duration: 12m 55s)

Change #1148365 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] mediawiki-dumps-legacy: Bump dumps toolbox image tag

https://gerrit.wikimedia.org/r/1148365

Change #1148365 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki-dumps-legacy: Bump dumps toolbox image tag

https://gerrit.wikimedia.org/r/1148365

Change #1148863 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps@master] Adapt dump scripts for running in containers

https://gerrit.wikimedia.org/r/1148863

Change #1148863 merged by Btullis:

[operations/dumps@master] Adapt dump scripts for running in containers

https://gerrit.wikimedia.org/r/1148863

Change #1152058 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] dumps: Bump toolbox mediawiki image

https://gerrit.wikimedia.org/r/1152058

Change #1152058 merged by jenkins-bot:

[operations/deployment-charts@master] dumps: Bump toolbox mediawiki image

https://gerrit.wikimedia.org/r/1152058

Currently running the first manual dump of a cirrussearch section dump in the toolbox pod.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/dse-k8s-services/mediawiki-dumps-legacy$ kubectl exec -it mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p -- bash
Defaulted container "toolbox" out of: toolbox, mediawiki-dumps-legacy-resources-tls-proxy

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/$ /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s1.dblist
Dumping 7000834 documents (7000834 in the index)

It seems to have been killed, possibly with an oom error.
The output was:

Dumping 7000834 documents (7000834 in the index)
	2% done...
	4% done...
	6% done...
	8% done...
	10% done...
	12% done...

We can also see that it used up to 8GB of RAM.

image.png (468×1 px, 104 KB)

However, apart from a directory with today's date and an updated current symlink, no files seem to have been created.

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/cirrussearch$ ls -l
total 1
drwxrwsr-x 2 www-data www-data 0 May 20 15:53 20250520
drwxrwsr-x 2 www-data www-data 0 May 21 08:40 20250521
drwxrwsr-x 2 www-data www-data 0 May 29 12:35 20250529
lrwxrwxrwx 1 www-data www-data 8 May 29 12:35 current -> 20250529
www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/cirrussearch$ find
.
./20250520
./20250521
./current
./20250529

I'll have to keep investigating this.

Oh, it seems that it did produce a 14 GB temp files after all. I ran it with the --dryrun option to see the commands that would be executed.

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/$ /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s1.dblist --dryrun
mkdir -p '/mnt/dumpsdata/otherdumps/cirrussearch/20250529'
/mnt/dumpsdata/otherdumps/cirrussearch/20250529/enwiki-20250529-cirrussearch-content.json.gz or /mnt/dumpsdata/xmldatadumps/temp/enwiki-20250529-cirrussearch-content.json.gz already exists, skipping...
/usr/bin/php8.1 '/srv/mediawiki/multiversion/MWScript.php' extensions/CirrusSearch/maintenance/DumpIndex.php --wiki='enwiki' --indexSuffix='general' | /bin/gzip > '/mnt/dumpsdata/xmldatadumps/temp/enwiki-20250529-cirrussearch-general.json.gz'
mv '/mnt/dumpsdata/xmldatadumps/temp/enwiki-20250529-cirrussearch-general.json.gz' '/mnt/dumpsdata/otherdumps/cirrussearch/20250529/enwiki-20250529-cirrussearch-general.json.gz'
www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/$ ls -l /mnt/dumpsdata/xmldatadumps/temp/
total 13962480
drwxrwsr-x 79 www-data www-data          77 Apr 30 09:21 a
drwxrwsr-x 71 www-data www-data          69 Apr 30 09:48 b
-rw-rw-r--  1 www-data www-data 14297579520 May 29 14:29 enwiki-20250529-cirrussearch-content.json.gz
www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/$

So maybe these cirrussearch dumps will just need more RAM.

I executed a manual dump run for the content translation dump and it seems to be progressing well.

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/$ /usr/local/bin/dumpcontentxlation.sh

The files for this look good.

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/contenttranslation/20250529$ pstree -a 105
dumpcontentxlat /usr/local/bin/dumpcontentxlation.sh
  └─php8.1 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki enwiki -q --split-at 500 --outputdir /mnt/dumpsdata/otherdumps/contenttranslation/20250529-
      └─sh -c gzip > '/mnt/dumpsdata/otherdumps/contenttranslation/20250529/cx-corpora.en2ar.html.json.gz'
          └─gzip
www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/contenttranslation/20250529$

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/contenttranslation/20250529$ ls -lh
total 577M
-rw-rw-r-- 1 www-data www-data 4.0M May 29 15:39 cx-corpora._2af.html.json.gz
-rw-rw-r-- 1 www-data www-data  61M May 29 15:39 cx-corpora.en2af.html.json.gz
-rw-rw-r-- 1 www-data www-data 2.8M May 29 15:39 cx-corpora.en2ak.html.json.gz
-rw-rw-r-- 1 www-data www-data 510M May 29 15:45 cx-corpora.en2ar.html.json.gz
www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/contenttranslation/20250529$

Similarly, I did a manual dump of the growth-mentorship dump and this finished quickly, with no errors.

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:~$ /usr/local/bin/dump-growth-mentorship.sh --config /etc/dumps/confs/wikidump.conf.other

The output files in /mnt/dumpsdata/otherdumps/growthmentorship/20250529 look good.

I can start creating DAG files for these two dump types now.

The /usr/local/bin/create-media-per-project-lists.sh dump fails due to a missing /etc/dumps/dblists/globalusage.dblist file.

The shortURLs dumps seems to work with the command:

/usr/bin/python3 /srv/deployment/dumps/xmldumps-backup/onallwikis.py --wiki metawiki --configfile /etc/dumps/confs/wikidump.conf.dumps:monitor  --filenameformat 'shorturls-{d}.gz' --outdir '/mnt/dumpsdata/otherdumps/shorturls' --script extensions/UrlShortener/maintenance/dumpURLs.php 'compress.zlib://{DIR}'

The generated file is 79 MB in size.

www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/shorturls$ ls -lh
total 79M
-rw-r--r-- 1 www-data www-data 79M May 29 15:53 shorturls-20250529.gz
www-data@mediawiki-dumps-legacy-toolbox-5475d4b969-2d28p:/mnt/dumpsdata/otherdumps/shorturls$

Mentioned in SAL (#wikimedia-operations) [2025-06-10T11:35:32Z] <cgoubert@deploy1003> Started scap sync-world: mediawiki-cli: Fix the paths of some of the dumps scripts and config files - T394389

Mentioned in SAL (#wikimedia-operations) [2025-06-10T11:44:22Z] <cgoubert@deploy1003> Finished scap sync-world: mediawiki-cli: Fix the paths of some of the dumps scripts and config files - T394389 (duration: 08m 49s)

Change #1156862 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the mediawiki-dumps-legacy toolbox image

https://gerrit.wikimedia.org/r/1156862

Change #1156862 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the mediawiki-dumps-legacy toolbox image

https://gerrit.wikimedia.org/r/1156862

I'm now able to start a test run of the CategoriesRDF dump in the toolbox, like this:

/usr/local/bin/dumpcategoriesrdf.sh --config /etc/dumps/confs/wikidump.conf.other --list /srv/mediawiki/dblists/categories-rdf.dblist

The process tree looks OK.

www-data@mediawiki-dumps-legacy-toolbox-5fc88c9c76-cxk7t:/$ pstree -a 7
bash
  └─dumpcategoriesr /usr/local/bin/dumpcategoriesrdf.sh --config /etc/dumps/confs/wikidump.conf.other --list /srv/mediawiki/dblists/categories-rdf.dblist
      └─dumpcategoriesr /usr/local/bin/dumpcategoriesrdf.sh --config /etc/dumps/confs/wikidump.conf.other --list /srv/mediawiki/dblists/categories-rdf.dblist
          ├─gzip
          └─php8.1 /srv/mediawiki/multiversion/MWScript.php maintenance/dumpCategoriesAsRdf.php --wiki=arwiki --format=ttl

Files are being generated.

www-data@mediawiki-dumps-legacy-toolbox-5fc88c9c76-cxk7t:/$ find /mnt/dumpsdata/otherdumps/categoriesrdf/
/mnt/dumpsdata/otherdumps/categoriesrdf/
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/alswiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/acewiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/amwiktionary-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/adywiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/abwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/anwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/afwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/anwiktionary-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/afwikiquote-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/amwikimedia-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/afwikibooks-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/akwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/afwiktionary-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/angwiktionary-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/amwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/arcwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/lastdump/angwiki-categories.last
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/adywiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/acewiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/angwiktionary-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/afwikibooks-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/afwiktionary-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/anwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/akwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/afwikiquote-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/arcwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/anwiktionary-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/amwikimedia-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/amwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/alswiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/afwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/arwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/angwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/abwiki-20250613-categories.ttl.gz
/mnt/dumpsdata/otherdumps/categoriesrdf/20250613/amwiktionary-20250613-categories.ttl.gz

Change #1163341 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] dse-k8s: Increase maximum container/pod size for mediawiki-dumps-legacy

https://gerrit.wikimedia.org/r/1163341

Change #1163341 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s: Configure limitranges for mediawiki-dumps-legacy

https://gerrit.wikimedia.org/r/1163341

BTullis updated the task description. (Show Details)

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1483

Add links to the dumps on airflow/k8s documentation in the DAGs documentation

brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1485

(fix) common: differentiate dev instance script environments with k8s devenvs

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1485

(fix) common: differentiate dev instance script environments with k8s devenvs

Change #1172285 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Dumps: bump the mediawiki image deployed to the toolbox pod

https://gerrit.wikimedia.org/r/1172285

Change #1172285 merged by jenkins-bot:

[operations/deployment-charts@master] Dumps: bump the mediawiki image deployed to the toolbox pod

https://gerrit.wikimedia.org/r/1172285