User Details
- User Since
- Oct 8 2014, 7:09 PM (468 w, 2 d)
- Availability
- Available
- IRC Nick
- apergos
- LDAP User
- ArielGlenn
- MediaWiki User
- ArielGlenn [ Global Accounts ]
Mon, Sep 25
Tue, Sep 19
Just a note that sometimes size/page count/visible rev count might go down, if a large batch of pages are deleted for e.g. copyvio (more likely to occur on a small wiki).
Wed, Sep 13
We could make sure that for commonswiki, the setting config "sevenzipprefetch" is 0. I'll need to check that this is one of the settings that can be overriden, and that the code will recognize 0 as a 'false' value. This should get done before next month's full run.
There's one item in the checklist left before this task can be closed. And basically the holdup is just about getting the signoff from Tyler that the deployment trainings were completed; then we can get the rest of that item done.
Tue, Sep 12
To expand on this a bit more: we saw the same error and stack trace on a slightly different page range, but with the identical symptoms. Logstash link here: https://logstash.wikimedia.org/goto/62b164dd91e2763a0a402d02087be836 Running the job hangs at the same point every time, even if nothing else is happening on the host; there aren't a particularly large number of revisions for the problem page, and their size isn't very large either. As before, using bz2 prfetch files permits the job to run to completion.
So the patches went around and I checked that they are on snapshot03, but unfortunately I still see the error:
2023-09-12 05:20:33: enwiki (ID 14793) 683 pages (694.3|694.3/sec all|curr), 1000 revs (1016.6|1016.6/sec all|curr), ETA 2023-09-12 05:30:22 [max 600437] MWUnknownContentModelException from line 192 of /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php: The content model 'JadeJudgment' is not registered on this wiki. See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model. #0 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(247): MediaWiki\Content\ContentHandlerFactory->validateContentHandler('JadeJudgment', NULL) #1 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(181): MediaWiki\Content\ContentHandlerFactory->createContentHandlerFromHook('JadeJudgment') #2 /srv/mediawiki/php-master/includes/content/ContentHandlerFactory.php(93): MediaWiki\Content\ContentHandlerFactory->createForModelID('JadeJudgment') #3 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(474): MediaWiki\Content\ContentHandlerFactory->getContentHandler('JadeJudgment') #4 /srv/mediawiki/php-master/includes/export/XmlDumpWriter.php(402): XmlDumpWriter->writeSlot(Object(MediaWiki\Revision\SlotRecord), 1) #5 /srv/mediawiki/php-master/includes/export/WikiExporter.php(554): XmlDumpWriter->writeRevision(Object(stdClass), Array) #6 /srv/mediawiki/php-master/includes/export/WikiExporter.php(492): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Object(stdClass)) #7 /srv/mediawiki/php-master/includes/export/WikiExporter.php(316): WikiExporter->dumpPages('page_id >= 1900...', false) #8 /srv/mediawiki/php-master/includes/export/WikiExporter.php(208): WikiExporter->dumpFrom('page_id >= 1900...', false) #9 /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php(355): WikiExporter->pagesByRange(190001, 195001, false) #10 /srv/mediawiki/php-master/maintenance/dumpBackup.php(82): BackupDumper->dump(1, 1) #11 /srv/mediawiki/php-master/maintenance/includes/MaintenanceRunner.php(685): DumpBackup->execute() #12 /srv/mediawiki/php-master/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run() #13 /srv/mediawiki/multiversion/MWScript.php(159): require_once('/srv/mediawiki/...') #14 {main}
Perhaps the override isn't being respected, or the usage isn't quite right?
Fri, Sep 8
the ops-dumps email alias ought to get notified about things like this; that way all the right people will see it.
Thu, Sep 7
Just a quick note that this breaks testing of dumps for enwiki in deployment-prep. We can work around it by testing only on other wikis, but it would be nice for this to be cleaned up.
We missed you today for the training. I'm guessing that something came up? Go ahead and reschedule, if you are still interested!
Thu, Aug 31
Aug 29 2023
Thanks for the fix(es), everything is working as expected now.
Aug 28 2023
Verified that with those same files from the above command the error is still present, nothing in the MW codebase has changed whatever the underlying issue is.
Related: T324463
Who runs the findBadBlobs.php script in cases like this? It would be nice to get that done.
Not doing this, since we now have WME (Enterprise) dumps in HTML format available for public download.
Going ahead and closing this.
Aug 24 2023
grep on mwmaint1002 for php, looking for long running stuff, gives me only
Jul11 0:00 /bin/bash /usr/local/bin/mwscript eval.php --wiki=commonswiki
The others are all Aug 22 or 23rd just fyi.
I"m presuming you didn't see any instances of this in the meantime, @awight ? Can we close this?
Hey @Sgs we missed you this morning at the deployment window for training. Or were you going to do the UTC late window this time?
Aug 22 2023
I can try to dust off and restructure the troubleshooting guide on wikitech for the sql/xml dumps, if that would be helpful. This would by no means be a replacement for the runbook, but more of a minimal guide if people get stuck. Having a document specifically for dumps newcomers is great and I hope it will be expanded over time!
Aug 17 2023
Counterpoint: knowing the config settings doesn't mean understanding the code activated by those changes or its possible impacts. At least, not for me. Some areas I know, and some I don't.
Aug 16 2023
Aug 14 2023
Note that since dumps snapshot instances are sorta-kinda mediawiki instances, this affects them too.
Aug 9 2023
Aug 7 2023
Aug 5 2023
Jul 27 2023
This training happened, though it was a lot less interactive and useful than it could have been because no patches were scheduled and no one showed up with a patch last minute, in spite of me begging :-D But we went through a description of all the steps, looked at all the relevant dashboards and hosts and commands, so there's that :-)
Jul 19 2023
Dumpsdata1007, running bullseye, is now the fallback host for sql/xml and misc dumps. This means all hosts in production (not spares) are on bullseye now and this task can be closed after a day or so just to make sure things are stable.
Jul 17 2023
Sounds great to me, thanks!
Jul 13 2023
Just for my understanding, it looks like the new patch would exception out in the case where there is a failure with the last connection of whatever sort. Am I reading that right? And if so, how does that help us in the current situation? Sorry for whatever I am missing here. Thanks!
Hey @elukey (or anyone else watching who wants to chime in), I've got a recipe that might maybe possibly could work. (See patch above.) But I have questions. Some recipes in the repo deal with lvm partitions by "unknown ignore" instead of "lvmpv keep", and I wonder which is better. Some recipes without swap specify "d-i partman-basicfilesystems/no_swap boolean false" and some do not, and I wonder which is right. And last but not least, is it still the procedure for testing before merge, to announce "hey I'm testing on installX00Y and disabling puppet for awhile" in the channel and hoping no one speaks up? Thanks in advance!
Jul 12 2023
A note that I did a test run of sql/xml dumps on deployment-prep with the new icu version and it looks fine to me, though I didn't check for any weird details of category sorting or whatever.
Jul 11 2023
See also https://phabricator.wikimedia.org/T341045 for the context. @WDoranWMF please sign off just in case that's needed. Thanks!
Jul 10 2023
It sounds like the reuse-parts.cfg script is the way to go. Let me poke around and see how that's used elsewhere, and I'll come back if I get stuck. Thanks!
Jul 9 2023
One more to add:
Jul 4 09:16:50 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20230703/commonswiki-20230703-cirrussearch-content.json.gz
@JEbe-WMF you will need to folllow the instructions here https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Checklist and create a task, feel free to add me as a subscriber and link this one to it. Make sure you ask for membership in the wmf ldap group. That will give you icinga/grafana/logstash access.
Dan and Xabriel already are members of the wmf group, giving access to grafana and icinga (though contact info might need to be added for executing commands on icinga). Jennifer is not yet in the group.
Swapped dumpsdata1003 in as the live nfs share for misc dumps.
Jul 6 2023
@JEbe-WMF and @xcollazo you should both sign up for MediaWiki deployment training here: https://phabricator.wikimedia.org/project/board/5265/ and get scheduled for that. Once that's done, we can add you to the deployers list in puppet. (Dan you are already a deployer so you're off the hook ;-) )
@JEbe-WMF @Milimetric and @xcollazo would you please subscribe to https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/ and let me know which email addresses you used? I will add them as co-admins of the list. Thanks!
@RobH Would you be willing to add Milimetric, xcollazo and Jebe-WMF to the #acl*procurement-review acl so that they can view procurement tasks? I hope no new tasks will come up for some time but just in case, and it will let us look at psat ones and discuss. They will be working with me on the dumps now. Thanks!
@WDoranWMF we will need your approval for this.
After a conversation with Will, Dan and others, the people who need the above access are @Milimetric , @xcollazo and @JEbe-WMF so now let me get started on that.
Jul 5 2023
Adding two more failures:
Jul 4 09:16:50 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20230703/mlwiki-20230703-cirrussearch-content.json.gz Jul 4 09:16:50 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20230703/metawiki-20230703-cirrussearch-general.json.gz
These came from different db sections so the script running those finished later, and hence the error report was sent later.
Jul 4 2023
I've not spun up any more buster images, and the next one I create will likely be bullseye. Maybe someone else has done so though.
Jun 22 2023
The above rsync is already complete.
Started an rsync from dumpsdata1004 (fallback nfs share for both sql/xml and misc dumps) to dumpsdata1003, in a screen sessnio as ariel, bandwidth limited to 1G as has previously been requested by SRE folks. We'll do another one of these Friday evening and again just before the swap.
After reimaging with bullseye, checked rpcinfo -p on dumpsdata1003 and the ports for mountd, nfs, nlockmgr are all correct, so once this host has the right data on it, it will be ready to go to be swapped in for dumpsdata1002, which can then be prepped for decommissioning.
I think this is resolved; the current run is already available on the public web server and on the nfs share for WMCS instances, with the same number of files as the download on the 1st of the month. Closing.
When I reran one of these jobs, it ran to completion in the usual period of time. Next time we see this behavior, we can try shooting the job and letting it rerun in a timely fashion, rather than blocking for days. Not exactly a resolution to whatever the underlying bug may have been, but it will have to do.
@fgiunchedi I notice that in some cases phab tasks are autocreated when systemd units fail. Is that true for systemd jobs on snapshot hosts? Could we get tagged on those (Dumps-Generation) or could we get emails from those (ops-dumps@wm.o)?
Just a note that we still regularly see these errors on each dump run for a small selection of wikis.
@WDoranWMF am I right to assume that this is long since moot, superceded by various other things? If not, can the task be updated to reflect the current work left to do and who will be taking that on? If you are not the right person to answer this, perhaps you can redirect me to the right person. Thanks!
Closing this task since the dumps do run to completion now.
Nothing left to do here, closing.
Never mind, it's already marked as inactive and it can stay that way. I'll make sure it's gone from our mirrors lists on various web pages too.
Right, I'm going to make this mirror inactive, it's clearly not coming back. If someone over there changes their mind, we can reenable it.
Long since fixed.
Long since fixed.
There's no point in having this open for a once a year check in, so I'll go ahead and close it. When capacity planning needs to be done for dbs in the regular course of things, this can be discussed.