Page MenuHomePhabricator

ArielGlenn (ariel)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 7:09 PM (228 w, 1 d)
Availability
Available
IRC Nick
apergos
LDAP User
ArielGlenn
MediaWiki User
ArielGlenn [ Global Accounts ]

Recent Activity

Yesterday

ArielGlenn added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

@Melderick Restarts are done automatically by the script running the specific dump, up to a certain number of times. That is independent of the starting date. If we have one wrapper script run them one after another, the same will still hold.

Thu, Feb 21, 8:23 AM · Analytics, Dumps-Generation, Wikidata

Sat, Feb 16

ArielGlenn committed R1891:06df9b86c131: showcrcs: util to write out crc information from a bzip2 file (authored by ArielGlenn).
showcrcs: util to write out crc information from a bzip2 file
Sat, Feb 16, 5:46 PM
ArielGlenn moved T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Sat, Feb 16, 4:34 PM · Analytics, Dumps-Generation, Wikidata
ArielGlenn added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

I've sent mail to wikitech-l, xmldatadumps-l and research-internal (not sure if that last arrived, I may not be subscribed and tbh I am full up on subscriptions). See https://lists.wikimedia.org/pipermail/xmldatadumps-l/2019-February/001455.html

Sat, Feb 16, 4:34 PM · Analytics, Dumps-Generation, Wikidata
ArielGlenn added a comment to T216009: See if we can recombine ordinary page content bz2 files by cleverly recalculating the file crc etc, after removing some blocks.

The last iteration of this utility is both fast and seemingly accurate at generating the file crc from the block crcs. The code is still a bit rough but it does the trick. Next up: a utility that takes a sequence of filespecs plus block ranges to extract and merge into a new file; this will be a general bz2 file rewriter with no decompression.

Sat, Feb 16, 4:28 PM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T215414: Index formating issue in multistream-index file from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Sat, Feb 16, 4:26 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

Merged and deployed, with a commit message that I forgot to update, though I double-checked everything else (of course). No longer a work in progress and has been tested with large files. We'll see how everything goes on the run of the 20th.

Sat, Feb 16, 4:26 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

The last fixup jobs are complete, and all updated files should be available for download along with corrected checksums later today, if they are not already.

Sat, Feb 16, 4:04 PM · Patch-For-Review, Dumps-Generation

Fri, Feb 15

ArielGlenn committed R1891:cd9fd15b57cf: showcrcs: util to write out crc information from a bzip2 file (authored by ArielGlenn).
showcrcs: util to write out crc information from a bzip2 file
Fri, Feb 15, 10:57 PM
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

Wikidatawiki's run is complete. I am now running the no-op job for that; the enwiki noop job is still running but should complete in some hours.

Fri, Feb 15, 5:09 PM · Patch-For-Review, Dumps-Generation
ArielGlenn committed R1891:60e58de5166a: showcrcs: util to write out crc information from a bzip2 file (authored by ArielGlenn).
showcrcs: util to write out crc information from a bzip2 file
Fri, Feb 15, 10:45 AM
ArielGlenn added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

The directories and any links or status files would be as they are now, but the date could I presume be passed into the script. We need again to see what existing users need and expect though.

Fri, Feb 15, 9:50 AM · Analytics, Dumps-Generation, Wikidata
ArielGlenn added a comment to T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

What I'd prefer to do, if we are changing things around, is to do one right after another, so: all, truthy, lexemes back to back, and decide on a starting date for the first one. This ensures we aren't running more than one type at once (control over resource use on the server) and that there's no dead time between runs (so server maintenance etc can be scheduled easier).

Fri, Feb 15, 9:43 AM · Analytics, Dumps-Generation, Wikidata
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

Status and sha1/md5sum files are being regenerated for enwiki now and should be available for download later in the day. Wikidata's dump run should complete later today and then these files can be rebuilt for it as well.

Fri, Feb 15, 8:22 AM · Patch-For-Review, Dumps-Generation

Thu, Feb 14

ArielGlenn updated subscribers of T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday.

Adding @hoo and @Smalyshev because they may have an idea of the needs of current users of these entity dumps.

Thu, Feb 14, 5:45 PM · Analytics, Dumps-Generation, Wikidata
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

New files have been generated for all of the above wikis; new sha1/md5 sums and status files are being generated for all but enwiki and wikidatawiki right now, and should be available for download sometime tomorrow.

Thu, Feb 14, 5:38 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T216067: Recover from corrupted beta MySQL slave (deployment-db04).

Not I. Anyone else?

Thu, Feb 14, 3:26 PM · Beta-Cluster-Infrastructure
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

https://gist.github.com/apergos/49c2812292c0ec26ccedc8bc9d5d69e5 is a standalone script which does what the gerrit patch does but regenerates just the combined index file, to an arbitrary location as opposed to the live dump directory.

Thu, Feb 14, 3:00 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T216067: Recover from corrupted beta MySQL slave (deployment-db04).

Correct, it needed python3, the directory exists , with python3 the error is different and it's likely the error jcrespo mentioned earlier.

Thu, Feb 14, 1:46 PM · Beta-Cluster-Infrastructure
ArielGlenn added a comment to T216067: Recover from corrupted beta MySQL slave (deployment-db04).

When I go to the directory and do the import manually I get:

Thu, Feb 14, 1:04 PM · Beta-Cluster-Infrastructure
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

Only the so-called 'big wikis' are subjec to this bug, since only for them do we generate smaller index files and recombine them. Of those, the only ones big enough to have the issue are:
en, wikidata, commons, de, fr, es, it, ja, ru wikis. Once there is a verified standalone script to just fix up the index file, I'll run it followed by removal of the sha1 and md5sums for the bad index files, and then a no-op on all of these wikis so that status files are regenerated. Wikidata will need to wait until it's completed the 7z step that it's currently in the middle of, since we can't regenerate status files for a wiki currently in the middle of a dump run.

Thu, Feb 14, 12:34 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T216067: Recover from corrupted beta MySQL slave (deployment-db04).

@Marostegui /srv/sqldata has 39G on it on db04, presumably that's pretty close to the amount of data on the master.

Thu, Feb 14, 7:23 AM · Beta-Cluster-Infrastructure

Wed, Feb 13

ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

This turns out to be an annoying limitation of (m)awk, that numbers are actually doubles. I'll have to write a tiny python script to do the index file merge.

Wed, Feb 13, 4:24 PM · Patch-For-Review, Dumps-Generation
ArielGlenn committed R1891:7704095ceaf8: showcrcs: util to write out crc information from a bzip2 file (authored by ArielGlenn).
showcrcs: util to write out crc information from a bzip2 file
Wed, Feb 13, 11:20 AM
ArielGlenn moved T216009: See if we can recombine ordinary page content bz2 files by cleverly recalculating the file crc etc, after removing some blocks from Backlog to Active on the Dumps-Generation board.
Wed, Feb 13, 10:48 AM · Patch-For-Review, Dumps-Generation
ArielGlenn triaged T216009: See if we can recombine ordinary page content bz2 files by cleverly recalculating the file crc etc, after removing some blocks as Normal priority.
Wed, Feb 13, 10:27 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T215414: Index formating issue in multistream-index file from Backlog to Active on the Dumps-Generation board.
Wed, Feb 13, 10:22 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T215414: Index formating issue in multistream-index file.

Hmm, that will be an artifact of the new recombine code. I'll have a look.

Wed, Feb 13, 10:22 AM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T215216: 2019-02-02 01:42:58 enwiki: dump aborted as Resolved.

This run is complete. Typically when you see that a run has been aborted, you simply need to wait a couple of days for the job to have been restarted and some output and status files to be produced and copied out to the public web servers. I have the advantage here because I was able to check the internal files at the time and see that everything was ok. Thanks for the report.

Wed, Feb 13, 10:22 AM · Dumps-Generation

Tue, Feb 12

ArielGlenn added a comment to T214984: PHP7's stricter JSON parsing breaks some wiki content.

Do we know which of those spaces in particular was the culprit, or if all of them were? Asking just out of curiosity.

Tue, Feb 12, 12:47 PM · Graphs, Maps (Kartographer), PHP 7.2 support

Fri, Feb 8

ArielGlenn added a comment to T214984: PHP7's stricter JSON parsing breaks some wiki content.

Ah note also that if you click through on the (empty) map in the wp article, you do see a properly rendered full screen map. Huh!

Fri, Feb 8, 7:36 PM · Graphs, Maps (Kartographer), PHP 7.2 support
ArielGlenn added a comment to T214984: PHP7's stricter JSON parsing breaks some wiki content.

After removal of that one trailing comma, I have done the sandbox trick and stuffed the wikidata id entry in where it should be (Q867944 in place of (if(?id = wd:blah...) and I get a nice little rendered map with two stroke colors and the whole thing. This was taking a copy of the OSM template and plugging some values directly into it (lat/long, frame width/height, zoom, plain=yes). So uh? I must be overlooking something obvious.

Fri, Feb 8, 7:34 PM · Graphs, Maps (Kartographer), PHP 7.2 support

Wed, Feb 6

ArielGlenn added a comment to T214984: PHP7's stricter JSON parsing breaks some wiki content.

Poking at this carefully: afaict the wikidata query that gets run is https://query.wikidata.org/#SELECT%20%3Fid%20%3Flength%0A%20%20%28if%28%3Fid%20%3D%20wd%3AQ867944%2C%20%27%23C12838%27%2C%20%27%2307c63e%27%29%20as%20%3Fstroke%29%0A%20%20%28concat%28%27Line%20length%3A%20%27%2C%20str%28%3Flength%29%2C%20%27%20km%27%29%20as%20%3Fdescription%29%0A%20%20%28if%28BOUND%28%3Flink%29%2C%0A%20%20%20%20%20%20concat%28%27%5B%5B%27%2C%20substr%28str%28%3Flink%29%2C31%2C500%29%2C%20%27%7C%27%2C%20%3FidLabel%2C%20%27%5D%5D%27%29%2C%0A%20%20%20%20%20%20%3FidLabel%29%0A%20%20%20as%20%3Ftitle%29%0AWHERE%20%7B%20%20%0A%20%20%7B%3Fid%20wdt%3AP16%20wd%3AQ5228578.%7D%0A%20%20OPTIONAL%20%7B%20%3Fid%20wdt%3AP2043%20%3Flength.%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%27en%27%20.%0A%20%20%20%20%3Fid%20rdfs%3Alabel%20%3FidLabel%20.%0A%20%20%7D%0A%20%20OPTIONAL%20%7B%3Flink%20schema%3Aabout%20%3Fid.%0A%20%20%3Flink%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E.%7D%0A%7D%20GROUP%20BY%20%3Fid%20%3Flink%20%3FidLabel%20%3Flength%0A
I have gotten this by plugging in Q5228578 for the whole if clause with highway_system_qid and {{wikidata|property|raw|P16}}, Q867944 for {{wikidata|label|raw}}, (grabbed this value by evaluating the templatthe pipe on a preview copy of the Dōtō Expressway article), and by replacing {{!}} with the pipe symbol. But the result gives a list of all the expressways in Japan, not just the one we want. Did I get the query right? Surely this is not what's desired.

Wed, Feb 6, 7:08 PM · Graphs, Maps (Kartographer), PHP 7.2 support
ArielGlenn moved T213200: refactor recompressxml and writeuptopageid from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Wed, Feb 6, 4:56 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T213200: refactor recompressxml and writeuptopageid as Resolved.

Everything sure looks good. Closing.

Wed, Feb 6, 4:56 PM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T213912: Produce multistream dumps in parallel from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Wed, Feb 6, 4:55 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T213912: Produce multistream dumps in parallel as Resolved.

Spot checked the content of combined files, also checked for links on the web page, everything looks good. Closing.

Wed, Feb 6, 4:55 PM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T204531: Wikidata dumps creating large amounts of log spam from Backlog to Done on the Dumps-Generation board.
Wed, Feb 6, 1:40 PM · MW-1.32-notes, MW-1.30-release-notes, MW-1.31-release-notes, Performance-Team, Datacenter-Switchover-2018, MediaWiki-Logging, Wikidata, Dumps-Generation, Wikimedia-production-error
ArielGlenn moved T182572: cat gzipped files together instead of uncompressing during recombines from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Wed, Feb 6, 1:40 PM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T212349: Make miscdumplib and its callers write to log file from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Wed, Feb 6, 1:40 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T182572: cat gzipped files together instead of uncompressing during recombines as Resolved.

These files look good; I spot-checked a few for content. Closing this task.

Wed, Feb 6, 1:39 PM · Patch-For-Review, Dumps-Generation

Tue, Jan 29

ArielGlenn moved T214293: See why wikidata xml/sql dumps pages-meta-history is so much slower than enwiki from Backlog to Active on the Dumps-Generation board.
Tue, Jan 29, 4:25 PM · Performance, Wikidata, Dumps-Generation

Mon, Jan 28

ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

Wikidata's web page is now updated. This ticket will remain open until we see multistream files generated and recombined *and published properly* for the next run.

Mon, Jan 28, 2:50 PM · Patch-For-Review, Dumps-Generation

Sun, Jan 27

ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

The wikidata fix is now running and should complete sometime later today.

Sun, Jan 27, 2:52 PM · Patch-For-Review, Dumps-Generation

Thu, Jan 24

ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

commonswiki and enwiki fixupsare running now; only wikidata left, which will be taken care of at the end of its current run.

Thu, Jan 24, 10:35 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

The manual reruns are complete except for commons and wikidata.

Thu, Jan 24, 6:37 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

The fix has been deployed and I am manually running updates for all big wikis now except for commons and wikidata; these will need to wait until their runs complete.

Thu, Jan 24, 2:57 PM · Patch-For-Review, Dumps-Generation

Wed, Jan 23

ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

The html links for the recombined multistream files for the big wikis are wrong. Woops! I'll fix these manually once the runs for these are complete, and get a fix to the code in before the next run.

Wed, Jan 23, 7:01 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T214293: See why wikidata xml/sql dumps pages-meta-history is so much slower than enwiki .

I should point out that in the original files enwiki-20190101-pages-meta-history10.xml-p2534537p2554779.bz2 and wikidatawiki-20190101-pages-meta-history27.xml-p56428595p56649675.bz2, the revison counts were comparable. 1500850 vs 1484162, but the line counts of the uncompressed content were not: 693941886 vs 28694664 lines. That's due to the very particular structure of a wikidata entry as stored, and it seems that structure is one of the worst cases for the standard bzip2 implementation. Anyways, more testing soon!

Wed, Jan 23, 4:26 PM · Performance, Wikidata, Dumps-Generation
ArielGlenn added a comment to T214293: See why wikidata xml/sql dumps pages-meta-history is so much slower than enwiki .

It does indeed look like the specific compression implementation.

Wed, Jan 23, 4:20 PM · Performance, Wikidata, Dumps-Generation

Jan 21 2019

ArielGlenn added a comment to T214293: See why wikidata xml/sql dumps pages-meta-history is so much slower than enwiki .

I don't see a big difference in prefetch percentages for the two wikis. It's worth checking if the compression itself is slower for wikidata revisions due to their content structure.

Jan 21 2019, 12:51 PM · Performance, Wikidata, Dumps-Generation
ArielGlenn triaged T214293: See why wikidata xml/sql dumps pages-meta-history is so much slower than enwiki as High priority.
Jan 21 2019, 12:47 PM · Performance, Wikidata, Dumps-Generation
ArielGlenn added a comment to T213200: refactor recompressxml and writeuptopageid.

This has been deployed and is in effect for the xml/sql dumps run that started yesterday. Will wait for the end of the run and check results before closing the ticket.

Jan 21 2019, 11:53 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T182572: cat gzipped files together instead of uncompressing during recombines from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Jan 21 2019, 11:52 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T213912: Produce multistream dumps in parallel from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Jan 21 2019, 11:52 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

Watching these til the end of the current run. If all looks good, the ticket can be closed then.

Jan 21 2019, 11:52 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

Watching these til the end of the current run. If all looks good, the ticket can be closed then.

Jan 21 2019, 11:51 AM · Patch-For-Review, Dumps-Generation

Jan 19 2019

ArielGlenn added a comment to T213912: Produce multistream dumps in parallel.

Installed updated mwbzutils and deployed all code. This will be in effect for the next xml/sql dumps run, which starts tomorrow.

Jan 19 2019, 10:14 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

Installed updated mwbzutils and deployed all code. This will be in effect for the next xml/sql dumps run, which starts tomorrow.

Jan 19 2019, 10:13 PM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T213912: Produce multistream dumps in parallel from Backlog to Active on the Dumps-Generation board.
Jan 19 2019, 10:13 PM · Patch-For-Review, Dumps-Generation
ArielGlenn committed R1891:094958f86c36: version 0.0.9 (authored by ArielGlenn).
version 0.0.9
Jan 19 2019, 8:44 PM
ArielGlenn committed R1891:e733b345fbe1: options for writeuptopageid to skip writing header or footer (authored by ArielGlenn).
options for writeuptopageid to skip writing header or footer
Jan 19 2019, 8:44 PM
ArielGlenn committed R1891:5c2a2df6b37f: option to skip siteinfo header, mw footer for recompressing files (authored by ArielGlenn).
option to skip siteinfo header, mw footer for recompressing files
Jan 19 2019, 8:44 PM

Jan 18 2019

ArielGlenn committed R1891:037b5a7b6a5a: fix up iohandlers to write separate streams for header and footer again (authored by ArielGlenn).
fix up iohandlers to write separate streams for header and footer again
Jan 18 2019, 8:35 PM

Jan 16 2019

ArielGlenn triaged T213912: Produce multistream dumps in parallel as High priority.
Jan 16 2019, 12:27 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

This is looking good; it is ready to go once the current dump run ends. Wikidata is finishing up the last pages-meta-history file; then there will be the 7z files and it will be complete.

Jan 16 2019, 10:28 AM · Patch-For-Review, Dumps-Generation

Jan 14 2019

ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

At a blocksize of 256k we get 19 minutes for a dd of a 51GB file over NFS, as opposed to the 3 hours it takes to gunzip/gzip files for the stubs recombine step. That's good enough to move forward. Larger blocksizes show no appreciable gain.

Jan 14 2019, 4:42 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T182351: Make HTML dumps available.

These are full html of the pages or 'just' of the parsed/rendered wikitext, or...? And, is there a notion of what the code looks like or what components are involved, so we can know if it can be folded into our infrastructure? Note there is draft code already for parsed/rendered wikitext, as pulled from Restbase.

Jan 14 2019, 1:18 PM · Datasets-Archiving, Analytics, Research
ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

Doing dd over nfs timing tests now to make sure we gain a reasonable amount of time back.

Jan 14 2019, 12:17 PM · Patch-For-Review, Dumps-Generation

Jan 11 2019

ArielGlenn added a comment to T213405: zhwiki pages-meta-history bz2 dump hangs.

Run is completed. I am still trying to create smaller input files that reproduce the problem, no luck yet. In the meantime, the revision that breaks everything is this one: https://zh.wikipedia.org/w/index.php?title=Template:X2&oldid=35165097 (WARNING, this is a 12 megabyte text revision so it may break your browser!)

Jan 11 2019, 11:50 AM · Chinese-Sites, Dumps-Generation

Jan 10 2019

ArielGlenn added a comment to T213405: zhwiki pages-meta-history bz2 dump hangs.

Ne bz2 file has been copied into place and current zhwiki processes shot; the scheduler should pick it up later today or tomorrow and complete the run.

Jan 10 2019, 6:37 PM · Chinese-Sites, Dumps-Generation
ArielGlenn added a comment to T213405: zhwiki pages-meta-history bz2 dump hangs.

Uncompressed stubs and output file don't help; uncompressed prefetch files seem to make a difference. The output file is now about halfway complete and much farther on than the two attempts that hung. I'll go ahead and let this complete, compress the file and move it into place so that the regular run can continue on, while I continue looking into the cause of the problem.

Jan 10 2019, 2:03 PM · Chinese-Sites, Dumps-Generation
ArielGlenn added a comment to T213405: zhwiki pages-meta-history bz2 dump hangs.

Right now I'm checking to see if compression/decompression of any of the files plays a role; I've also noted which revision of which page is the last to be written. The following revision, at least in the stub file, sure looks harmless enough.

Jan 10 2019, 11:53 AM · Chinese-Sites, Dumps-Generation
ArielGlenn moved T213200: refactor recompressxml and writeuptopageid from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Jan 10 2019, 10:40 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T213405: zhwiki pages-meta-history bz2 dump hangs from Backlog to Active on the Dumps-Generation board.
Jan 10 2019, 10:40 AM · Chinese-Sites, Dumps-Generation
ArielGlenn triaged T213405: zhwiki pages-meta-history bz2 dump hangs as High priority.
Jan 10 2019, 10:37 AM · Chinese-Sites, Dumps-Generation
ArielGlenn added a comment to T213200: refactor recompressxml and writeuptopageid.

Results of timing tests:

$ time (bzcat /mnt/dumpsdata/xmldatadumps/public/enwiki/20181201/enwiki-20181201-pages-articles.xml.bz2 | ./recompressxml_prod --pagesperstream 100 --buildindex prod_index.bz2 > prod_pages.bz2)
Jan 10 2019, 10:30 AM · Patch-For-Review, Dumps-Generation

Jan 9 2019

ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

https://github.com/apergos/misc-wmf-crap/commit/b973e9a1b6d8d73d63e933ffb362c02b25b9db2c Here's a link to that test code; the real thing will of course be added to the recombinejobs class and live in gerrit with the rest of the dumps scripts.

Jan 9 2019, 10:21 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T182572: cat gzipped files together instead of uncompressing during recombines.

The right way to do this seems to be:

  • for each numbered stub output file for big wikis, gzip the xml header and write to output file, gzip the body as it gets appended to the output file, and gzip the footer and cat that on the end.
  • for the combined file, look at each numbered stub output file in turn, looking for gzip file headers; note the body and footer offsets for each; then with that info cat the heder from first file, body from all files, footer from last file.
Jan 9 2019, 12:53 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T213200: refactor recompressxml and writeuptopageid.

Timing tests are still ongoing. If those pan out, packages are ready to go on install1002 in my home directory. We need to wait until the current run is complete, merge and deploy everything, push out the package to the repo, and then update the package on all snapshots.

Jan 9 2019, 9:49 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T213200: refactor recompressxml and writeuptopageid from Backlog to Active on the Dumps-Generation board.
Jan 9 2019, 9:38 AM · Patch-For-Review, Dumps-Generation

Jan 8 2019

ArielGlenn committed R1891:b9ec06a9f8d1: version 0.0.9 (authored by ArielGlenn).
version 0.0.9
Jan 8 2019, 9:02 PM
ArielGlenn committed R1891:c679cff5e2ba: option to skip siteinfo header, mw footer for recompresing files (authored by ArielGlenn).
option to skip siteinfo header, mw footer for recompresing files
Jan 8 2019, 9:02 PM
ArielGlenn committed R1891:1895be25852c: options for writeuptopageid to skip writing header or footer (authored by ArielGlenn).
options for writeuptopageid to skip writing header or footer
Jan 8 2019, 9:02 PM
ArielGlenn committed R1891:5d01fd9e1032: use iohandlers for recompressxml input and output (authored by ArielGlenn).
use iohandlers for recompressxml input and output
Jan 8 2019, 9:02 PM
ArielGlenn committed R1891:d026f51b2a02: move iohandler code for compression/decompression out to a separate file (authored by ArielGlenn).
move iohandler code for compression/decompression out to a separate file
Jan 8 2019, 9:02 PM
ArielGlenn added a comment to T213200: refactor recompressxml and writeuptopageid.

While these changes have all been thoroughly tested to make sure they work as advertised, I need to do timing tests yet, to make sure they don't slow down the appropriate dump steps appreciably.

Jan 8 2019, 6:16 PM · Patch-For-Review, Dumps-Generation
ArielGlenn triaged T213200: refactor recompressxml and writeuptopageid as Normal priority.
Jan 8 2019, 6:07 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T212349: Make miscdumplib and its callers write to log file as Resolved.

Run back to normal output levels, closing.

Jan 8 2019, 1:10 PM · Patch-For-Review, Dumps-Generation

Jan 7 2019

ArielGlenn moved T212349: Make miscdumplib and its callers write to log file from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.

Tests looked good but I'll leave it open for a day so we can doublecheck that the cronspam only contains errors and that the run otherwise looks good.

Jan 7 2019, 2:04 PM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T209006: Missing wikidatawiki-20181101-pages-articles.xml.bz2 in md5/sha1sums.txt from Backlog to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Wikidata, Dumps-Generation
ArielGlenn moved T212462: Truncated XML from Backlog to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T211039: fiwiki-20181201-pages-articles.xml.bz2 doesn't have corresponding md5/sha1 entries from Backlog to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Dumps-Generation
ArielGlenn moved T210623: CirrusSearch dumps for the Norwegian Bokmål Wikipedia link to the Italian Wikipedia from Backlog to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Discovery-Search (Current work), Dumps-Generation, CirrusSearch
ArielGlenn moved T209362: Russian wiki dump misses articles categories from Active to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Dumps-Generation
ArielGlenn moved T207030: wikidata rdf dumps cron job complaining for lexemes phase from Active to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Patch-For-Review, Dumps-Generation, Wikidata
ArielGlenn moved T179059: Consider skipping or modifying recombine step for page content dumps for wikidata from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Jan 7 2019, 11:50 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T29112: Select of revisions for stub history files does not explicitly order revisions from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Jan 7 2019, 11:49 AM · Dumps-Generation, User-ArielGlenn, MW-1.28-release-notes, MW-1.28-release (WMF-deploy-2016-06-21_(1.28.0-wmf.7)), Patch-For-Review, DBA, Datasets-General-or-Unknown
ArielGlenn moved T203382: clean up fulldumps cron job on snapshot1005 from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Jan 7 2019, 11:49 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T199204: Check for slow meta-history runs for small wikis and see about speedups from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Jan 7 2019, 11:49 AM · Dumps-Generation