Page MenuHomePhabricator

ArielGlenn (ariel)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 7:09 PM (279 w, 6 d)
Availability
Available
IRC Nick
apergos
LDAP User
ArielGlenn
MediaWiki User
ArielGlenn [ Global Accounts ]

Recent Activity

Today

ArielGlenn added a comment to T236431: Data dumps for the MachineVision extension.

Shall I add these as a weekly run?

Wed, Feb 19, 7:33 AM · SDC-Statements (Machine-vision-depicts), Structured-Data-Backlog, Dumps-Generation, MachineVision

Yesterday

ArielGlenn closed T245193: Wikidata aborted page-meta-history jobs after db1087 depooled as Resolved.

The run is now complete. I'll open a separate task later for the mw maintenance script behavior.

Tue, Feb 18, 12:14 PM · Dumps-Generation
ArielGlenn added a comment to T245193: Wikidata aborted page-meta-history jobs after db1087 depooled.

The following files, as well as the associated 7z files, have been generated.

/mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20200201/wikidatawiki-20200201-pages-meta-history27.xml-p39078190p39156572.bz2
/mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20200201/wikidatawiki-20200201-pages-meta-history27.xml-p39022925p39078189.bz2
/mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20200201/wikidatawiki-20200201-pages-meta-history27.xml-p38438419p38499657.bz2

I am running a noop job to update hash sums, status files, html files and latest links, manually in a screen session on snapshot1005. Once that is done the dump run will be complete.

Tue, Feb 18, 9:45 AM · Dumps-Generation

Mon, Feb 17

ArielGlenn added a comment to T245193: Wikidata aborted page-meta-history jobs after db1087 depooled.

7z's are being produced now, but those three bz2 files are still missing. I'll copy them in at the end of the run, likely late today. Then I'll manually generate the 7zs and do a no-op job to update hashes and status files.

Mon, Feb 17, 6:48 AM · Dumps-Generation

Fri, Feb 14

ArielGlenn moved T245193: Wikidata aborted page-meta-history jobs after db1087 depooled from Backlog to Active on the Dumps-Generation board.
Fri, Feb 14, 7:57 AM · Dumps-Generation

Thu, Feb 13

ArielGlenn triaged T245193: Wikidata aborted page-meta-history jobs after db1087 depooled as Medium priority.
Thu, Feb 13, 7:32 PM · Dumps-Generation

Wed, Feb 12

ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

@Cparle, No blocks on your side, the ball is now in @dcausse 's court. :-)

Wed, Feb 12, 5:18 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn awarded T244132: Discuss adding grafana dashboard json to codesearch a 100 token.
Wed, Feb 12, 9:14 AM · User-Addshore, Graphite, VPS-project-codesearch
ArielGlenn added a comment to T238972: switch xml/sql (and adds-changes) dumps to use 0.11 schema with content from multiple slots.

This is pending https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/556346/ and related patches, so we're looking at March 1 if all goes well.

Wed, Feb 12, 6:36 AM · User-notice, Wikidata, Research, Dumps-Generation

Mon, Feb 10

ArielGlenn updated subscribers of T239866: Investigate use of bz2 decompression tools on multistream files.

I've asked @JAllemandou to check the hadoop import tools too.

Mon, Feb 10, 6:10 PM · Dumps-Generation

Fri, Feb 7

ArielGlenn updated subscribers of T244545: Add x-request-id to httpd (apache) logs.

Adding @Ottomata as a heads up that these log lines will have an additional element in them, in case that impacts analytics processing.

Fri, Feb 7, 8:50 AM · Operations, Traffic, serviceops

Thu, Feb 6

ArielGlenn added a comment to T243701: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service).

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Thu, Feb 6, 10:11 AM · Wikidata-Campsite, Traffic, Operations, Performance Issue, Discovery, Wikidata-Query-Service, Wikidata

Wed, Feb 5

ArielGlenn added a comment to T241149: rdfDump.php generates error messages when dumping for pages without mediainfo items.

Ah yes it is! The flag does all it needs to, sorry about that.

Wed, Feb 5, 4:20 PM · Structured-Data-Backlog (Current Work), Structured Data Engineering, WikibaseMediaInfo, Dumps-Generation
ArielGlenn added a comment to T241794: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet.

How does the above ETA look, now that all hands as done and you have a better idea of what's on your plate?

Wed, Feb 5, 9:20 AM · Patch-For-Review, ops-eqiad, Dumps-Generation, Operations

Fri, Jan 31

ArielGlenn added a comment to T226167: audit public tables and make sure we dump them all.

After short irc chat, new estimate for wbterms migration to complete is in 3-4 weeks. I'll update this task around then.

Fri, Jan 31, 9:34 AM · Patch-For-Review, Dumps-Generation

Thu, Jan 30

ArielGlenn updated subscribers of T243948: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left.

I can see that the challenges get set on the dns hosts by e.g. dig @208.80.154.238 -t txt _acme-challenge.wiki-pedia.org a little past the hour and getting appropriate responses back for the text record.

Thu, Jan 30, 2:06 PM · Operations, Traffic

Fri, Jan 24

ArielGlenn added a comment to T241794: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet.

Ok, that will let me schedule to fold it in by March 1 then. Thanks for the update.

Fri, Jan 24, 8:16 PM · Patch-For-Review, ops-eqiad, Dumps-Generation, Operations

Thu, Jan 23

ArielGlenn added a comment to T243481: mass revert of phabricator vandalism by user Nafees791 needed.

I don't know that we can do a rollback after there have already been some reverts, maybe? It might not be worth it for a small number of changes like this though https://wikitech.wikimedia.org/wiki/Phabricator#Revert_all_activity_of_a_given_user (prefer a db snapshot first, etc)

Thu, Jan 23, 9:41 AM · Phabricator
ArielGlenn added a comment to T243481: mass revert of phabricator vandalism by user Nafees791 needed.

Did the rest (I think) but someone ought to double check that none were missed.

Thu, Jan 23, 9:30 AM · Phabricator
ArielGlenn renamed T241688: Work to reduce smarty cruft on the server from ruri to Work to reduce smarty cruft on the server .
Thu, Jan 23, 9:27 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Wikimedia-Fundraising-CiviCRM, Fundraising-Backlog, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T241688: Work to reduce smarty cruft on the server as "Open".
Thu, Jan 23, 9:26 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Wikimedia-Fundraising-CiviCRM, Fundraising-Backlog, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn renamed T243340: Normalize createPayment responses for Ingenico and Adyen PaymentProviders from ljgyb to Normalize createPayment responses for Ingenico and Adyen PaymentProviders.
Thu, Jan 23, 9:24 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Patch-For-Review, Fundraising-Backlog, FR-Adyen, Recurring-Donations, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T243340: Normalize createPayment responses for Ingenico and Adyen PaymentProviders, a subtask of T238101: EPIC recurring for Adyen, as Open.
Thu, Jan 23, 9:24 AM · Epic, Recurring-Donations, FR-Adyen, Fundraising-Backlog
ArielGlenn reopened T243340: Normalize createPayment responses for Ingenico and Adyen PaymentProviders as "Open".
Thu, Jan 23, 9:24 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Patch-For-Review, Fundraising-Backlog, FR-Adyen, Recurring-Donations, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn renamed T243098: 16 Multilingual Thank You Emails from ghj to 16 Multilingual Thank You Emails.
Thu, Jan 23, 9:21 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn changed the status of T243098: 16 Multilingual Thank You Emails from Stalled to Open.
Thu, Jan 23, 9:21 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn renamed T198733: Move queue message docs to mw.o, update from ih to Move queue message docs to mw.o, update.
Thu, Jan 23, 9:19 AM · Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Fundraising Sprint Autocorrect Astrology Ascendant, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T198733: Move queue message docs to mw.o, update, a subtask of T209872: epic: frtech onwiki documentation overhaul, as Open.
Thu, Jan 23, 9:19 AM · Epic, fundraising-tech-ops, Fundraising-Backlog
ArielGlenn reopened T198733: Move queue message docs to mw.o, update as "Open".
Thu, Jan 23, 9:19 AM · Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Fundraising Sprint Autocorrect Astrology Ascendant, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn renamed T232636: Restore target smart data lost on legacy merge screen from fvhs to Restore target smart data lost on legacy merge screen.
Thu, Jan 23, 9:17 AM · Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fundraising Sprint Trojan Horse Wisperer, Fundraising Sprint Usual Subscripts, Fundraising Sprint Visual Basic Instinct, Fundraising Sprint A Wrinkle in Timezones, Fundraising Sprint X-rays, Fundraising Sprint YAMLton, the Musical, Fundraising Sprint Autocorrect Astrology Ascendant, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn changed the status of T232636: Restore target smart data lost on legacy merge screen from Stalled to Open.
Thu, Jan 23, 9:17 AM · Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fundraising Sprint Trojan Horse Wisperer, Fundraising Sprint Usual Subscripts, Fundraising Sprint Visual Basic Instinct, Fundraising Sprint A Wrinkle in Timezones, Fundraising Sprint X-rays, Fundraising Sprint YAMLton, the Musical, Fundraising Sprint Autocorrect Astrology Ascendant, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn renamed T233374: Automatic DAF thank you email from gon to Automatic DAF thank you email.
Thu, Jan 23, 9:12 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fundraising Sprint Usual Subscripts, Fundraising Sprint Visual Basic Instinct, Fundraising Sprint A Wrinkle in Timezones, Fundraising Sprint X-rays, Fundraising Sprint Autocorrect Astrology Ascendant, Patch-For-Review, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T233374: Automatic DAF thank you email as "Open".
Thu, Jan 23, 9:11 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fundraising Sprint Usual Subscripts, Fundraising Sprint Visual Basic Instinct, Fundraising Sprint A Wrinkle in Timezones, Fundraising Sprint X-rays, Fundraising Sprint Autocorrect Astrology Ascendant, Patch-For-Review, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn renamed T242160: Maintenance script to perform recurring payment capture from tnnh to Maintenance script to perform recurring payment capture.
Thu, Jan 23, 9:07 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, FR-Adyen, Recurring-Donations, Fundraising Sprint Autocorrect Astrology Ascendant, Patch-For-Review, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T242160: Maintenance script to perform recurring payment capture as "Open".
Thu, Jan 23, 9:06 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Fundraising-Backlog, FR-Adyen, Recurring-Donations, Fundraising Sprint Autocorrect Astrology Ascendant, Patch-For-Review, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T242160: Maintenance script to perform recurring payment capture, a subtask of T238101: EPIC recurring for Adyen, as Open.
Thu, Jan 23, 9:06 AM · Epic, Recurring-Donations, FR-Adyen, Fundraising-Backlog
ArielGlenn renamed T242277: Update Adyen SmashPig code to be able to create recurring donations from smui to Update Adyen SmashPig code to be able to create recurring donations.
Thu, Jan 23, 9:03 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Recurring-Donations, Patch-For-Review, Fundraising Sprint Autocorrect Astrology Ascendant, Fundraising-Backlog, FR-Adyen, Fundraising Sprint Byzantine Empire Strikes Back
ArielGlenn reopened T242277: Update Adyen SmashPig code to be able to create recurring donations, a subtask of T238101: EPIC recurring for Adyen, as Open.
Thu, Jan 23, 9:02 AM · Epic, Recurring-Donations, FR-Adyen, Fundraising-Backlog
ArielGlenn reopened T242277: Update Adyen SmashPig code to be able to create recurring donations as "Open".
Thu, Jan 23, 9:02 AM · Fundraising Sprint Dampness, Fundraising Sprint CAPS LOCK CULTS, Recurring-Donations, Patch-For-Review, Fundraising Sprint Autocorrect Astrology Ascendant, Fundraising-Backlog, FR-Adyen, Fundraising Sprint Byzantine Empire Strikes Back

Wed, Jan 22

ArielGlenn moved T243434: Dumps should write pagerange info for page content jobs to a file from Backlog to Active on the Dumps-Generation board.
Wed, Jan 22, 7:33 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

Some unexpected (?) triples popping up that @dcausse is looking into, so the dumps will not be turned on in cron until we have the thumbs up on that. See T243292

Wed, Jan 22, 6:19 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a subtask for T221917: Create RDF dump of structured data on Commons: T243292: Fix the munger to support commons RDF dump.
Wed, Jan 22, 6:17 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a parent task for T243292: Fix the munger to support commons RDF dump: T221917: Create RDF dump of structured data on Commons.
Wed, Jan 22, 6:17 PM · Wikidata, Wikidata-Query-Service
ArielGlenn triaged T243434: Dumps should write pagerange info for page content jobs to a file as Medium priority.
Wed, Jan 22, 6:06 PM · Patch-For-Review, Dumps-Generation

Tue, Jan 21

ArielGlenn added a comment to T243279: Document cases when bzip decompression of dumps files may fail.

Relevant tasks: T208647 T239866 T243241

Tue, Jan 21, 12:06 PM · Dumps-Generation
ArielGlenn moved T243279: Document cases when bzip decompression of dumps files may fail from Backlog to Active on the Dumps-Generation board.
Tue, Jan 21, 12:05 PM · Dumps-Generation
ArielGlenn triaged T243279: Document cases when bzip decompression of dumps files may fail as Medium priority.
Tue, Jan 21, 12:05 PM · Dumps-Generation
ArielGlenn moved T243055: Publish SQL dumps of CodeReview tables from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Tue, Jan 21, 11:59 AM · DBA, Dumps-Generation, MediaWiki-extensions-CodeReview
ArielGlenn moved T243241: Some xml-dumps files don't follow BZ2 'correct' definition from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Tue, Jan 21, 11:58 AM · Analytics-Kanban, Dumps-Generation, Analytics
ArielGlenn added a comment to T243241: Some xml-dumps files don't follow BZ2 'correct' definition.

From chat on irc: we're waiting on a release to come from the upstream patch; then folks here will see if they can tweak the local hadoop dependencies to pull in that version. In the meantime since this is a rare issue, scripting that tries converting to bz2 in the event of import failure can work around the issue. More updates here as things happen upstream.

Tue, Jan 21, 11:58 AM · Analytics-Kanban, Dumps-Generation, Analytics
ArielGlenn added a comment to T243241: Some xml-dumps files don't follow BZ2 'correct' definition.

Thanks for the heads up. This is probably a side effect of using lbzip2 as the compressor for these files. I'll be monitoring the progress of the upstream bug. In the meantime, might you be able to use bzip2 to decompress and recompress the problem file(s) so that you can get your import into hadoop done?

Tue, Jan 21, 8:29 AM · Analytics-Kanban, Dumps-Generation, Analytics

Mon, Jan 20

ArielGlenn closed T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort as Resolved.

The run completed early this morning or late last night.

Mon, Jan 20, 4:11 PM · Patch-For-Review, Dumps-Generation

Jan 20 2020

ArielGlenn updated subscribers of T243055: Publish SQL dumps of CodeReview tables.

Adding @Bstorm for the labstore boxes which is where these files will land when published.

Jan 20 2020, 10:02 AM · DBA, Dumps-Generation, MediaWiki-extensions-CodeReview

Jan 16 2020

ArielGlenn updated subscribers of T221917: Create RDF dump of structured data on Commons.

@dcausse is going to check over the ttl dump and let me know if it looks ok; if so then I'll flip the switch for generation weekly and make sure there's cleanup too.

Jan 16 2020, 5:23 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn moved T221917: Create RDF dump of structured data on Commons from Blocked/Stalled/Waiting for event to Active on the Dumps-Generation board.
Jan 16 2020, 5:09 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

@ArielGlenn Yes, it works! And with this additional parameter, it works for both bzip files, so I can already adapt the tool.
Just a final doubt this task means that the dumps xxwiki-xxxxxxxx-pages-articles.xml.bz2 will be generated no more ?

Jan 16 2020, 3:47 PM · Dumps-Generation
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

In https://dumps.wikimedia.org/other/wikibase/commonswiki/ there are two ttl files, gz and bz2 compressed. Please have a look!

Jan 16 2020, 2:13 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

@Benjavalero I think you are using BZip2CompressorInputStream in your code? You must tell it that you want it to decompress multiple concatenated stream if there are any. See: https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html Let me know if this works!

Jan 16 2020, 12:50 PM · Dumps-Generation
ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

@Benjavalero Thanks for testing! I think we can handwave about the python2 script, since Python 2 is officially EOL. The java tool concerns me however; can you give me a link to the tool, or even better, to its source? And also please let me know the exact command you run, with flags. I'll try to duplicate it here and see what's up. Thanks!

Jan 16 2020, 12:34 PM · Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Wikidata 7z files up through part of part27, and the associated hash files, are done and I'm producing more with a manual run a couple times a day. We should be in good shape for finishing up the run in time.

Jan 16 2020, 10:20 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

I found a ticket that mentions use of ttl files so I'll run

/usr/local/bin/dumpwikibaserdf.sh commons full ttl

and keep an eye on it. Running on snapshot1008 in a screen session. Here we go!

Jan 16 2020, 10:17 AM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata

Jan 14 2020

ArielGlenn added a comment to T240520: Produce dumps of commons thumbnail URLs.

<snip>

I've adjusted the script to parallelize checking the containers, and adjusted the bash script to invoke it with 4 workers. The workers coordinate based on an output lock so only one writes at a time, with a reasonable per-thread buffer. Seems to work reasonably well.

Jan 14 2020, 9:03 AM · Patch-For-Review, Dumps-Generation, Internet-Archive, Datasets-Archiving
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

I've manually pushed https://gerrit.wikimedia.org/r/#/c/operations/dumps/+/562828/ to the version of the dumps repo in use by the current job, on snapshot1006 where the wikidata run is continuing. This will ensure that the 7z job will skip any 7z files generated in the meantime instead of cleaning them all up first.

Jan 14 2020, 8:40 AM · Patch-For-Review, Dumps-Generation

Jan 13 2020

ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

I plan to try running

/usr/local/bin/dumpwikibaserdf.sh commons full nt

on Thursday morning and see how long it takes with the 8 shards that are currently configured. @Abit is the nt format the one needed for WDQS testing?

Jan 13 2020, 12:11 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

Ran

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 500 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 78846320 --last-page-id 79046320 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small-shard-oom.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-small-oom.gz
Jan 13 2020, 11:55 AM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

Ran

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 1000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 1 --sharding-factor 4  2>/var/lib/dumpsgen/mediainfo-log-small-shard.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-of-4-small.gz

and it also ran fine.

Jan 13 2020, 11:02 AM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

Note to self that a run of

Jan 13 2020, 10:58 AM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

This morning the job was terminated by the oom killer:

Jan 13 2020, 10:06 AM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Since this task is nominally about the wikidata run abort, I'll put the catchup measures for that run here too. I'm starting wikidata page content 7z recompression runs in a screen session on snapshot1005:

bash fixup_scripts/do_7z_jobs.sh --config /etc/dumps/confs/wikidump.conf.dumps:wd --jobinfo 1,2,3,4,5,6,7,8 --date 20200101 --numjobs 20 --skiplock  --wiki wikidatawik

for various jobinfo values.

Jan 13 2020, 9:28 AM · Patch-For-Review, Dumps-Generation

Jan 10 2020

ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

A batchsize of 50k turned out to be too large. Same with 5k. I'm now running with a batchsize of 500, which will surely be too small, but at least I am getting output. I'll check on it tomorrow and see how it's doing.

Jan 10 2020, 8:24 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata
ArielGlenn added a comment to T221917: Create RDF dump of structured data on Commons.

Because I've gotten a nice run in beta with the --ignore-missing flag, I'm trying a test run on snapshot1008 in a screen session:

Jan 10 2020, 1:56 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata

Jan 9 2020

ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Redid page history content, 7zs, noop for eswiki with the new code, it looks ok. I want to further test the new code and deploy it before closing this task though.

Jan 9 2020, 6:53 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T241149: rdfDump.php generates error messages when dumping for pages without mediainfo items.

Brilliant! I'll be doing some fun things tomorrow then. Thanks!

Jan 9 2020, 4:12 PM · Structured-Data-Backlog (Current Work), Structured Data Engineering, WikibaseMediaInfo, Dumps-Generation
ArielGlenn added a comment to T241149: rdfDump.php generates error messages when dumping for pages without mediainfo items.

It would work to get a test dump out, and yeah I'll do a little test first. But for production I'd like to be able to not write them at all, no point to it.

Jan 9 2020, 3:25 PM · Structured-Data-Backlog (Current Work), Structured Data Engineering, WikibaseMediaInfo, Dumps-Generation
ArielGlenn added a comment to T182351: Make HTML dumps available.

@leila I still really want these to happen. As RESTbasse moves towards being phased out I'm trying to have the discussion about access to its replacement and how we might keep bulk access for dumps in mind. But it's going to need a lot of thought yet.

Jan 9 2020, 7:57 AM · Research-Backlog, Datasets-Archiving, Analytics
ArielGlenn updated subscribers of T221917: Create RDF dump of structured data on Commons.

@Abit: I need to get my last question on T241149 answered; if these errors only go to stderr then I can at least run a test dump, but if they go to logstash that's 50 million log entries as the task description says, which would be pretty unacceptable. @Cparle has said he could have a look at that in particular, but really anyone who knows that code can have a look.

Jan 9 2020, 7:50 AM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Wikidata-Query-Service, Commons, Wikidata

Jan 8 2020

ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Fail, this was a part of the code path that apparently never got exercised. Heh. Have a patch to the patch but it's too late now for even testing. Tomorrow.

Jan 8 2020, 11:56 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

I have a fix that looks like it might work, going to try it on the missing eswiki page history content files now. They'll finish up sometime over night and I'll check them tomorrow.

Jan 8 2020, 10:17 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Still looking into the source of the weird arguments to writeuptopageid that cause the problem. More updates tomorrow.

Jan 8 2020, 7:23 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Well that's not true. The current bz2 and 7z files for some of page history content are wrong. I don't know what went wrong so I will toss them all, and the temp stubs, and rerun them.

Jan 8 2020, 5:03 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

I am now running a noop job on eswiki, so it should be good shortly. Just need to make sure no 'extra' files are copied to dumpsdata1003 or to labstore1006,7.

Jan 8 2020, 4:40 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

I've started the 7z job in a screen session on snapshot1005, since there's available cores and no new dumps that will take those resources.

Jan 8 2020, 3:02 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T242221: Don't remove generated 7z content files when rerunning 7z step as Resolved.

Huh. Well, seeing as this is merged already, I guess this is done.

Jan 8 2020, 2:33 PM · Dumps-Generation
ArielGlenn added a comment to T242221: Don't remove generated 7z content files when rerunning 7z step.

This is already tested and works fine. I'll merge when we want to run the eswiki 7z step as needed for T242209

Jan 8 2020, 1:49 PM · Dumps-Generation
ArielGlenn triaged T242221: Don't remove generated 7z content files when rerunning 7z step as Medium priority.
Jan 8 2020, 1:45 PM · Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

The missing bz2 content files are being generated now.

Jan 8 2020, 12:42 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

I'd like to move forward with switching to multiple stream files in all cases. Have sent another email to the list to see if we get any more interest or any pushback.

Jan 8 2020, 10:49 AM · Dumps-Generation
ArielGlenn moved T241794: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Jan 8 2020, 10:47 AM · Patch-For-Review, ops-eqiad, Dumps-Generation, Operations
ArielGlenn moved T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort from Backlog to Active on the Dumps-Generation board.
Jan 8 2020, 10:47 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

eswiki produced an exception during its run, and left around duplicate and truncated temporary stub files. I cleaned up the following temp stubs:

-rw-r--r-- 1 dumpsgen dumpsgen      6302 Jan  7 04:53 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p712340p768034.gz
-rw-r--r-- 1 dumpsgen dumpsgen 108155079 Jan  6 04:48 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p712340p768089.gz
-rw-r--r-- 1 dumpsgen dumpsgen       907 Jan  7 04:54 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p768035p838115.gz
-rw-r--r-- 1 dumpsgen dumpsgen 107605336 Jan  6 04:49 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p768090p838203.gz
-rw-r--r-- 1 dumpsgen dumpsgen       895 Jan  7 04:55 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p838116p908010.gz
-rw-r--r-- 1 dumpsgen dumpsgen 107743034 Jan  6 04:49 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p838204p908213.gz
-rw-r--r-- 1 dumpsgen dumpsgen      5092 Jan  7 04:56 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p908011p986575.gz
-rw-r--r-- 1 dumpsgen dumpsgen 106798396 Jan  6 04:50 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p908214p986877.gz
-rw-r--r-- 1 dumpsgen dumpsgen      2275 Jan  7 04:56 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p986576p986877.gz
-rw-r--r-- 1 dumpsgen dumpsgen 108538927 Jan  6 04:51 /data/xmldatadumps/temp/e/eswiki/eswiki-20200101-stub-meta-history3.xml-p986878p1063682.gz
Jan 8 2020, 10:33 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort.

Yesterday evening I checked dewiki output files and temp stub files but saw no anomalies, so removing the temp output files looks like it was sufficient for that run.
Here's the listing before removal in case we need to revisit the issue:

-rw-r--r-- 1 dumpsgen dumpsgen 1843064429 Jan  6 19:35 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p796847p841103.bz2
-rw-r--r-- 1 dumpsgen dumpsgen 2554761161 Jan  7 07:49 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p841104p886901.bz2
-rw-r--r-- 1 dumpsgen dumpsgen   80971717 Jan  6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p841104p886912.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 1350372220 Jan  7 07:22 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p886902p941391.bz2
-rw-r--r-- 1 dumpsgen dumpsgen   10420343 Jan  6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p886913p941293.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen  107818682 Jan  6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p941294p992526.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen 1301482595 Jan  7 07:17 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p941392p992651.bz2
-rw-r--r-- 1 dumpsgen dumpsgen   84718874 Jan  6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p992527p1044968.bz2.inprog
-rw-r--r-- 1 dumpsgen dumpsgen   47810144 Jan  6 20:13 ../public/dewiki/20200101/dewiki-20200101-pages-meta-history2.xml-p1044969p1095940.bz2.inprog
Jan 8 2020, 10:26 AM · Patch-For-Review, Dumps-Generation
ArielGlenn triaged T242209: Clean up after kiling snapshot1006 dump processes due to wikidata dump run abort as Medium priority.
Jan 8 2020, 10:22 AM · Patch-For-Review, Dumps-Generation

Jan 7 2020

ArielGlenn added a comment to T241794: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet.

Might I be able to get this by Jan 25? This will allow me to do set-up and have it ready to go by Feb 1st.

Jan 7 2020, 3:10 PM · Patch-For-Review, ops-eqiad, Dumps-Generation, Operations
ArielGlenn moved T241149: rdfDump.php generates error messages when dumping for pages without mediainfo items from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Jan 7 2020, 3:05 PM · Structured-Data-Backlog (Current Work), Structured Data Engineering, WikibaseMediaInfo, Dumps-Generation
ArielGlenn moved T241794: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet from Backlog to Active on the Dumps-Generation board.
Jan 7 2020, 3:04 PM · Patch-For-Review, ops-eqiad, Dumps-Generation, Operations
ArielGlenn added a comment to T240520: Produce dumps of commons thumbnail URLs.

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?

This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.

The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?

Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.

Jan 7 2020, 10:21 AM · Patch-For-Review, Dumps-Generation, Internet-Archive, Datasets-Archiving

Jan 3 2020

ArielGlenn added a comment to T240520: Produce dumps of commons thumbnail URLs.

A couple questions as I read through the patch:

Jan 3 2020, 3:04 PM · Patch-For-Review, Dumps-Generation, Internet-Archive, Datasets-Archiving

Dec 30 2019

ArielGlenn lowered the priority of T241573: Investigate all-site outage on 2019-12-30 (HTTP 504 error) from Unbreak Now! to High.

Measures for remediation applied, lowering the priority of this task for the moment but not yet closing.

Dec 30 2019, 8:34 AM · Wikimedia-Incident, Operations
ArielGlenn reopened T241573: Investigate all-site outage on 2019-12-30 (HTTP 504 error) as "Open".

Not yet resolved. Re-opening for now.

Dec 30 2019, 8:24 AM · Wikimedia-Incident, Operations
ArielGlenn added projects to T241573: Investigate all-site outage on 2019-12-30 (HTTP 504 error): Operations, Wikimedia-Incident.
Dec 30 2019, 8:18 AM · Wikimedia-Incident, Operations