kibana entries for the https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/page257-836px-Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf.jpg failure: https://logstash.wikimedia.org/goto/1f1b3fb6e8821924e1d9a185efb9a869
It was indeed discussed this week and someone at that dicussion should be weighing in here these next few days.
That week we did not meet after all! :-( And this week I was missing, so I do not know if it was discussed. It should not be delayed any longer though. I will see what this week's clinic duty person can tell us.
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/517032/ this is the proposed format for wmf user agent strings for monitoring checks.
Note that with READ_NEW I was able to test for speed improvements and everything looks good now, as mentioned on the patchset.
Cherry picking https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511126/ and running with the setting
$wgMultiContentRevisionSchemaMigrationStage = SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD;
netted me an exception. Here's the details (with some formatting by me):
Wikimedia\Rdbms\DBQueryError from line 1586 of /home/system/www/html/elwikt/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application'\ s database schema updater after upgrading? Query: SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, rev_text_id,rev_content_format,rev_content_model, page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len, slots.rev_id AS `slot_revision_id`, NULL AS `slot_content_id`, slots.rev_id AS `slot_origin`, 'main' AS `role_name`, slots.rev_len AS `content_size`, slots.rev_sha1 AS `content_sha1`, CONCAT('tt:',slots.rev_text_id) AS `content_address`, slots.rev_text_id AS `rev_text_id`, slots.rev_content_model AS `model_name`, page_restrictions,1 AS `_load_content` FROM `page` JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor))
The current run shows no bad uids for zhwikisource stubs so this is done.
Fri, Jun 21
If we build it for reals, I'd ask @MoritzMuehlenhoff about all that. If we're just doing it for testing, shove it in some directory you have access to and tell me, and we can coordinate swapping it in for a few minutes for a test.
Thu, Jun 20
The gerrit change is ready for me to test now, probably by playing with it in beta a whole bunch.
You can ask, but this time I'll say "not yet" :-D
Svwiki's history run is starting to take long enough to split it up indeed. I'll add this to my todo list.
@jcrespo I'm adding you too, please remove yourself if you're already covered by other tasks.
I'm going to go ahead with this; @RobH, I need a quote for a host that looks like dumpsdata1001/2. Previous quote info is here: T161344 and specs are here: https://wikitech.wikimedia.org/wiki/Dumps/Dumpsdata_hosts#Hardware What else do you need from me to proceed? (If I should create a separate task for a different queue and tag it with a specific team, just lmk that too.)
Some useful links:
OK, this means there's nothing in the specific revisions themselves, and it's probably MediaWiki deadlocking itself.
I saw many slow en wiki page loads yesterday, including missing skins. (Logged in user, Europe.)
What do people think of a July 29 deadline (the start of that run)? Unfortunately we can't really do a 1st of the month change.
Because this affects downloaders, might as well blast xmldatadumps-l and tbh I would forward to wikitech-l too.
Time to figure out who/how we notify, and put a date out for the name change.
Wed, Jun 19
Those are just the refactored wb_terms table that is already provided. See T221764 which I happened to be looking at earlier today :-)
From email from @MarkTraceur
I could probably hack the script to let one optionally specify start date and end date, is it worth it though?
See T221917 where this is actually being done (1/2, the other half which is the inclusion in xml dumps, depends on some pending changes to core still in the works). Should I merge this task into the other one?
Should we have a ticket for misc cleanup like 'get rid of the BETA in filenames' and 'get rid of the 'legacy directory' stuff for the json dumps?
I'll leave this ticket open until we see that the next month's report has shown up.
Tue, Jun 18
In lieu of pages, I looked at the catalog (/var/lib/puppet/client_data/catalog/<hostname>.json and compared the entries for labmon1001 (which pages) and labstore1006. Yeah, I'm paranoid...
While I"m sure the above is quite broken in a variety of ways, this is the sort of thing I had in mind, being able to drop in one file with just values specific to commons (or whatever wikibase thing might come our way later), and change the project name only, getting everything else 'for free'. If we wind up wanting json dumps for commons/future projects, it should not be hard to do something similar for them. I personally would like to see json dumps happen btw; is that on the road map?
I checked the irc logs to see what icinga told us then:
[09:25:21] <icinga-wm> PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:13] <icinga-wm> RECOVERY - Host labstore1006 is UP: PING WARNING - Packet loss = 28%, RTA = 36.35 ms [09:31:51] <icinga-wm> PROBLEM - NFS on labstore1006 is CRITICAL: connect to address 188.8.131.52 and port 2049: Connection refused
Sun, Jun 16
Yes, I think that's a fine idea. Just one retry in an hour, let's say, no more than that.
Fri, Jun 14
Found them (thanks, tendril)!
transaction: 78697996224, run time 14s, stamp 2019-06-08 05:37:38
INSERT /* WikiPage::insertOn */ IGNORE INTO `page` (page_namespace, page_title, page_restrictions, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len) VALUES ('0', '??????', '', '0', '1', '0.163800399756', '20190608053733', '0', '0')
I don't know if it's exactly the same, because I didn't see other long (or short) running inserts around that time. I can copy/paste the relevant part of the binlog in a file if that's useful.
First, I can tell that the page and some revisions made it in. If we compare https://zh.wikipedia.org/w/index.php?title=%E9%82%AA%E5%85%B8%E9%9B%BB%E5%BD%B1%E5%88%97%E8%A1%A8&offset=&limit=250&action=history&uselang=en with https://zh.wikiversity.org/w/index.php?title=%E9%82%AA%E5%85%B8%E9%9B%BB%E5%BD%B1%E5%88%97%E8%A1%A8&offset=&limit=250&action=history&uselang=en the first 129 revisions were imported.
It will be live in around half an hour everywhere; sometime after that, please check that you can get to the hosts you expect.
Great! I'll make sure this is brought up at the next SRE meeting then (Monday).
Thu, Jun 13
I've been a bit sandbagged this week which is why you've seen neither a stack trace nor confirmation that I didn't screw up the testing. This weekend, or Monday at the earliest.
Testing was done using a tiny xml file, running:
root@wikitech-static:/srv/mediawiki/w# rm /var/log/debug-wikitech.log root@wikitech-static:/srv/mediawiki/w# rm /srv/mediawiki/images/wikitech/archive/4/4c/20130119210037\!20130119-2158-PuTTY_Configuration.png root@wikitech-static:/srv/mediawiki/w# php maintenance/importDump.php --uploads /root/testimport.xml
and looking at the log.
Found this. And it was silly. Just like Camelot.
This has since been set to standalone, and new certs were generated. See T204840#5243222 for the context. Should this task remain open?
The search listed as the second issue now works fine.
What happens to the pages on wikiech-static, which should *not* have access to wikidata data?
New image arrived. Proof: https://wikitech.wikimedia.org/wiki/File:%CE%A3%CF%84%CE%B9%CE%B3%CE%BC%CE%B9%CF%8C%CF%84%CF%85%CF%80%CE%BF_%CE%B1%CF%80%CF%8C_2019-06-12_14-45-40.png and https://wikitech-static.wikimedia.org/wiki/File:%CE%A3%CF%84%CE%B9%CE%B3%CE%BC%CE%B9%CF%8C%CF%84%CF%85%CF%80%CE%BF_%CE%B1%CF%80%CF%8C_2019-06-12_14-45-40.png
Wed, Jun 12
Yes, I agree. The patch is necessary, just not sufficient :-)
One thing that can happen with this changeset, in the case where a disk bounces like that repeatedly, is that before each puppet run chowns the directory to root, some additional data is copied to the local filesystem. After a few cycles that can fill / and we'll be back here again.
Out of desperation I have uploaded a new image to wikitech. I'll check tomorrow to see if it made it over.
I'm going to have a look at the code duplication a bit over the next few days and see if I can have a coounterproposal patch. If not then we'll just go ahead, I totally understand where you're coming from.
Tue, Jun 11
I'd like to see us test with a locally patched sshd and see if that's indeed the problem, as a first step.
Still waiting on @Tobi_WMDE_SW