Shall I add these as a weekly run?
The run is now complete. I'll open a separate task later for the mw maintenance script behavior.
The following files, as well as the associated 7z files, have been generated.
/mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20200201/wikidatawiki-20200201-pages-meta-history27.xml-p39078190p39156572.bz2 /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20200201/wikidatawiki-20200201-pages-meta-history27.xml-p39022925p39078189.bz2 /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20200201/wikidatawiki-20200201-pages-meta-history27.xml-p38438419p38499657.bz2
I am running a noop job to update hash sums, status files, html files and latest links, manually in a screen session on snapshot1005. Once that is done the dump run will be complete.
Mon, Feb 17
7z's are being produced now, but those three bz2 files are still missing. I'll copy them in at the end of the run, likely late today. Then I'll manually generate the 7zs and do a no-op job to update hashes and status files.
Fri, Feb 14
Thu, Feb 13
Wed, Feb 12
This is pending https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/556346/ and related patches, so we're looking at March 1 if all goes well.
Mon, Feb 10
I've asked @JAllemandou to check the hadoop import tools too.
Fri, Feb 7
Adding @Ottomata as a heads up that these log lines will have an additional element in them, in case that impacts analytics processing.
Thu, Feb 6
Wed, Feb 5
Ah yes it is! The flag does all it needs to, sorry about that.
How does the above ETA look, now that all hands as done and you have a better idea of what's on your plate?
Fri, Jan 31
After short irc chat, new estimate for wbterms migration to complete is in 3-4 weeks. I'll update this task around then.
Thu, Jan 30
I can see that the challenges get set on the dns hosts by e.g. dig @188.8.131.52 -t txt _acme-challenge.wiki-pedia.org a little past the hour and getting appropriate responses back for the text record.
Fri, Jan 24
Ok, that will let me schedule to fold it in by March 1 then. Thanks for the update.
Thu, Jan 23
I don't know that we can do a rollback after there have already been some reverts, maybe? It might not be worth it for a small number of changes like this though https://wikitech.wikimedia.org/wiki/Phabricator#Revert_all_activity_of_a_given_user (prefer a db snapshot first, etc)
Did the rest (I think) but someone ought to double check that none were missed.
Wed, Jan 22
Tue, Jan 21
From chat on irc: we're waiting on a release to come from the upstream patch; then folks here will see if they can tweak the local hadoop dependencies to pull in that version. In the meantime since this is a rare issue, scripting that tries converting to bz2 in the event of import failure can work around the issue. More updates here as things happen upstream.
Thanks for the heads up. This is probably a side effect of using lbzip2 as the compressor for these files. I'll be monitoring the progress of the upstream bug. In the meantime, might you be able to use bzip2 to decompress and recompress the problem file(s) so that you can get your import into hadoop done?
Mon, Jan 20
The run completed early this morning or late last night.
Jan 20 2020
Adding @Bstorm for the labstore boxes which is where these files will land when published.
Jan 16 2020
@dcausse is going to check over the ttl dump and let me know if it looks ok; if so then I'll flip the switch for generation weekly and make sure there's cleanup too.
In https://dumps.wikimedia.org/other/wikibase/commonswiki/ there are two ttl files, gz and bz2 compressed. Please have a look!
@Benjavalero I think you are using BZip2CompressorInputStream in your code? You must tell it that you want it to decompress multiple concatenated stream if there are any. See: https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html Let me know if this works!
@Benjavalero Thanks for testing! I think we can handwave about the python2 script, since Python 2 is officially EOL. The java tool concerns me however; can you give me a link to the tool, or even better, to its source? And also please let me know the exact command you run, with flags. I'll try to duplicate it here and see what's up. Thanks!
Wikidata 7z files up through part of part27, and the associated hash files, are done and I'm producing more with a manual run a couple times a day. We should be in good shape for finishing up the run in time.
I found a ticket that mentions use of ttl files so I'll run
/usr/local/bin/dumpwikibaserdf.sh commons full ttl
and keep an eye on it. Running on snapshot1008 in a screen session. Here we go!
Jan 14 2020
I've adjusted the script to parallelize checking the containers, and adjusted the bash script to invoke it with 4 workers. The workers coordinate based on an output lock so only one writes at a time, with a reasonable per-thread buffer. Seems to work reasonably well.
I've manually pushed https://gerrit.wikimedia.org/r/#/c/operations/dumps/+/562828/ to the version of the dumps repo in use by the current job, on snapshot1006 where the wikidata run is continuing. This will ensure that the 7z job will skip any 7z files generated in the meantime instead of cleaning them all up first.
Jan 13 2020
I plan to try running
/usr/local/bin/dumpwikibaserdf.sh commons full nt
on Thursday morning and see how long it takes with the 8 shards that are currently configured. @Abit is the nt format the one needed for WDQS testing?
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 500 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --ignore-missing --first-page-id 78846320 --last-page-id 79046320 --shard 0 --sharding-factor 1 2>/var/lib/dumpsgen/mediainfo-log-small-shard-oom.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-small-oom.gz
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 1000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 1 --sharding-factor 4 2>/var/lib/dumpsgen/mediainfo-log-small-shard.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-of-4-small.gz
and it also ran fine.
Note to self that a run of
This morning the job was terminated by the oom killer:
Since this task is nominally about the wikidata run abort, I'll put the catchup measures for that run here too. I'm starting wikidata page content 7z recompression runs in a screen session on snapshot1005:
bash fixup_scripts/do_7z_jobs.sh --config /etc/dumps/confs/wikidump.conf.dumps:wd --jobinfo 1,2,3,4,5,6,7,8 --date 20200101 --numjobs 20 --skiplock --wiki wikidatawik
for various jobinfo values.
Jan 10 2020
A batchsize of 50k turned out to be too large. Same with 5k. I'm now running with a batchsize of 500, which will surely be too small, but at least I am getting output. I'll check on it tomorrow and see how it's doing.
Because I've gotten a nice run in beta with the --ignore-missing flag, I'm trying a test run on snapshot1008 in a screen session:
Jan 9 2020
Redid page history content, 7zs, noop for eswiki with the new code, it looks ok. I want to further test the new code and deploy it before closing this task though.
Brilliant! I'll be doing some fun things tomorrow then. Thanks!
It would work to get a test dump out, and yeah I'll do a little test first. But for production I'd like to be able to not write them at all, no point to it.
@leila I still really want these to happen. As RESTbasse moves towards being phased out I'm trying to have the discussion about access to its replacement and how we might keep bulk access for dumps in mind. But it's going to need a lot of thought yet.
@Abit: I need to get my last question on T241149 answered; if these errors only go to stderr then I can at least run a test dump, but if they go to logstash that's 50 million log entries as the task description says, which would be pretty unacceptable. @Cparle has said he could have a look at that in particular, but really anyone who knows that code can have a look.
Jan 8 2020
Fail, this was a part of the code path that apparently never got exercised. Heh. Have a patch to the patch but it's too late now for even testing. Tomorrow.
I have a fix that looks like it might work, going to try it on the missing eswiki page history content files now. They'll finish up sometime over night and I'll check them tomorrow.
Still looking into the source of the weird arguments to writeuptopageid that cause the problem. More updates tomorrow.
Well that's not true. The current bz2 and 7z files for some of page history content are wrong. I don't know what went wrong so I will toss them all, and the temp stubs, and rerun them.
I am now running a noop job on eswiki, so it should be good shortly. Just need to make sure no 'extra' files are copied to dumpsdata1003 or to labstore1006,7.
I've started the 7z job in a screen session on snapshot1005, since there's available cores and no new dumps that will take those resources.
Huh. Well, seeing as this is merged already, I guess this is done.
This is already tested and works fine. I'll merge when we want to run the eswiki 7z step as needed for T242209
The missing bz2 content files are being generated now.
I'd like to move forward with switching to multiple stream files in all cases. Have sent another email to the list to see if we get any more interest or any pushback.
eswiki produced an exception during its run, and left around duplicate and truncated temporary stub files. I cleaned up the following temp stubs:
-rw-r--r-- 1 dumpsgen dumpsgen 6302 Jan 7 04:53 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p712340p768034.gz -rw-r--r-- 1 dumpsgen dumpsgen 108155079 Jan 6 04:48 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p712340p768089.gz -rw-r--r-- 1 dumpsgen dumpsgen 907 Jan 7 04:54 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p768035p838115.gz -rw-r--r-- 1 dumpsgen dumpsgen 107605336 Jan 6 04:49 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p768090p838203.gz -rw-r--r-- 1 dumpsgen dumpsgen 895 Jan 7 04:55 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p838116p908010.gz -rw-r--r-- 1 dumpsgen dumpsgen 107743034 Jan 6 04:49 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p838204p908213.gz -rw-r--r-- 1 dumpsgen dumpsgen 5092 Jan 7 04:56 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p908011p986575.gz -rw-r--r-- 1 dumpsgen dumpsgen 106798396 Jan 6 04:50 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p908214p986877.gz -rw-r--r-- 1 dumpsgen dumpsgen 2275 Jan 7 04:56 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p986576p986877.gz -rw-r--r-- 1 dumpsgen dumpsgen 108538927 Jan 6 04:51 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p986878p1063682.gz
Yesterday evening I checked dewiki output files and temp stub files but saw no anomalies, so removing the temp output files looks like it was sufficient for that run.
Here's the listing before removal in case we need to revisit the issue:
-rw-r--r-- 1 dumpsgen dumpsgen 1843064429 Jan 6 19:35 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p796847p841103.bz2 -rw-r--r-- 1 dumpsgen dumpsgen 2554761161 Jan 7 07:49 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p841104p886901.bz2 -rw-r--r-- 1 dumpsgen dumpsgen 80971717 Jan 6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p841104p886912.bz2.inprog -rw-r--r-- 1 dumpsgen dumpsgen 1350372220 Jan 7 07:22 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p886902p941391.bz2 -rw-r--r-- 1 dumpsgen dumpsgen 10420343 Jan 6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p886913p941293.bz2.inprog -rw-r--r-- 1 dumpsgen dumpsgen 107818682 Jan 6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p941294p992526.bz2.inprog -rw-r--r-- 1 dumpsgen dumpsgen 1301482595 Jan 7 07:17 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p941392p992651.bz2 -rw-r--r-- 1 dumpsgen dumpsgen 84718874 Jan 6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p992527p1044968.bz2.inprog -rw-r--r-- 1 dumpsgen dumpsgen 47810144 Jan 6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p1044969p1095940.bz2.inprog
Jan 7 2020
Might I be able to get this by Jan 25? This will allow me to do set-up and have it ready to go by Feb 1st.
Jan 3 2020
A couple questions as I read through the patch:
Dec 30 2019
Measures for remediation applied, lowering the priority of this task for the moment but not yet closing.
Not yet resolved. Re-opening for now.