Page MenuHomePhabricator

addWiki.php broken creating ES tables
Open, HighPublic

Description

Initialising external storage cluster24...
[6f30182d25a2d432f75d027c] [no req]   Wikimedia\Rdbms\DBReadOnlyError from line 1163 of /srv/mediawiki/php-1.33.0-wmf.9/includes/libs/rdbms/database/Database.php: Database is read-only: The database has been automatically locked until the replica database servers become available
Backtrace:
#0 /srv/mediawiki/php-1.33.0-wmf.9/includes/libs/rdbms/database/Database.php(4354): Wikimedia\Rdbms\Database->query(string, string)
#1 /srv/mediawiki/php-1.33.0-wmf.9/extensions/WikimediaMaintenance/addWiki.php(229): Wikimedia\Rdbms\Database->sourceStream(resource)
#2 /srv/mediawiki/php-1.33.0-wmf.9/maintenance/doMaintenance.php(94): AddWiki->execute()
#3 /srv/mediawiki/php-1.33.0-wmf.9/extensions/WikimediaMaintenance/addWiki.php(399): include(string)
#4 /srv/mediawiki/multiversion/MWScript.php(100): include(string)
#5 {main}

While doing T197616: Create a production test wiki in group0 to parallel Wikimedia Commons

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Given that happens intermittently, is this task a blocker for making new wikis?

Reedy added a comment.Feb 21 2019, 4:57 PM

How do you know it happens intermittently?

How do you know it happens intermittently?

Oh I thought it's dbstore1002 being grumpy.

Any progress here?

@Reedy would you have time to review the related patch?

ARR8 added a subscriber: ARR8.Mar 16 2019, 11:34 PM

@Reedy There are several wikis pending, is it possible to try to create them now?

Change 486406 had a related patch set uploaded (by Ladsgroup; owner: Aaron Schulz):
[mediawiki/extensions/WikimediaMaintenance@master] Make addWiki better handle second runs in case of errors

https://gerrit.wikimedia.org/r/486406

Change 486406 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Make addWiki better handle second runs in case of errors

https://gerrit.wikimedia.org/r/486406

okay, I reviewed the patch and merged it. I also made a follow up to clean the big method there if anyone wants to review it: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMaintenance/+/499014
Later this week I can make all of the needed wikis.

Krinkle closed this task as Resolved.Mar 26 2019, 12:38 AM
aaron removed aaron as the assignee of this task.May 10 2019, 2:47 AM
kchapman claimed this task.May 13 2019, 7:46 PM

Assigned to myself to track down ownership of the script

So, the followup for this...

I just did T223770: Please create beta zhwikivoyage and this was the error that came up, which is why we're not creating any new wikis atm. I think this bug should be closed, and a new one filed

[XOEw1KwQBHcAAH1T3fQAAABO] /wiki/%E9%A6%96%E9%A1%B5 MediaWiki\Revision\RevisionAccessException from line 1466 of /srv/mediawiki/php-master/includes/Revision/RevisionStore.php: Failed to load data blob from tt:1: Failed to load blob from address tt:1

Backtrace:

#0 /srv/mediawiki/php-master/includes/Revision/RevisionStore.php(1637): MediaWiki\Revision\RevisionStore->loadSlotContent(MediaWiki\Revision\SlotRecord, NULL, NULL, NULL, integer)
#1 [internal function]: Closure$MediaWiki\Revision\RevisionStore::loadSlotRecords(MediaWiki\Revision\SlotRecord)
#2 /srv/mediawiki/php-master/includes/Revision/SlotRecord.php(307): call_user_func(Closure$MediaWiki\Revision\RevisionStore::loadSlotRecords;2290, MediaWiki\Revision\SlotRecord)
#3 /srv/mediawiki/php-master/includes/Revision/RevisionRecord.php(175): MediaWiki\Revision\SlotRecord->getContent()
#4 /srv/mediawiki/php-master/includes/Revision/RenderedRevision.php(226): MediaWiki\Revision\RevisionRecord->getContent(string, integer, NULL)
#5 /srv/mediawiki/php-master/includes/Revision/RevisionRenderer.php(193): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string)
#6 /srv/mediawiki/php-master/includes/Revision/RevisionRenderer.php(142): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#7 [internal function]: Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#2(MediaWiki\Revision\RenderedRevision, array)
#8 /srv/mediawiki/php-master/includes/Revision/RenderedRevision.php(197): call_user_func(Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#2;958, MediaWiki\Revision\RenderedRevision, array)
#9 /srv/mediawiki/php-master/includes/poolcounter/PoolWorkArticleView.php(196): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#10 /srv/mediawiki/php-master/includes/poolcounter/PoolCounterWork.php(123): PoolWorkArticleView->doWork()
#11 /srv/mediawiki/php-master/includes/page/Article.php(773): PoolCounterWork->execute()
#12 /srv/mediawiki/php-master/includes/actions/ViewAction.php(68): Article->view()
#13 /srv/mediawiki/php-master/includes/MediaWiki.php(499): ViewAction->show()
#14 /srv/mediawiki/php-master/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
#15 /srv/mediawiki/php-master/includes/MediaWiki.php(865): MediaWiki->performRequest()
#16 /srv/mediawiki/php-master/includes/MediaWiki.php(515): MediaWiki->main()
#17 /srv/mediawiki/php-master/index.php(42): MediaWiki->run()
#18 /srv/mediawiki/w/index.php(3): include(string)
#19 {main}
Reedy added a comment.EditedMay 19 2019, 10:37 AM
MariaDB [zhwikivoyage]> select * from revision;
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp  | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1                        | rev_content_model | rev_content_format |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
|      1 |        1 |           1 |             |        0 |               | 20190519102809 |              0 |           0 |    1337 |             0 | k6h2u286l68zxe442zja3k1ajjg0m2m | NULL              | NULL               |
|      2 |        2 |           2 |             |        0 |               | 20190519102810 |              0 |           0 |     108 |             0 | snzf2pvqgfabhypsdu7rumjosbo3wwo | NULL              | NULL               |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
2 rows in set (0.00 sec)

MariaDB [zhwikivoyage]> select * from text;
+--------+-----------------+---------------------+
| old_id | old_text        | old_flags           |
+--------+-----------------+---------------------+
|      1 | DB://cluster1/2 | utf-8,gzip,external |
|      2 | DB://cluster1/3 | utf-8,gzip,external |
+--------+-----------------+---------------------+
2 rows in set (0.00 sec)

MariaDB [zhwikivoyage]> select * from blobs1;
Empty set (0.00 sec)
Gilles moved this task from Inbox to Radar on the Performance-Team board.May 20 2019, 8:31 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

What's the status of this bug? Anyone investigating?

Ankry added a subscriber: Ankry.Jun 2 2019, 9:15 PM
Yann added a subscriber: Yann.Jun 7 2019, 8:42 AM

What's the status of this bug? Anyone investigating?

...?

Koavf added a subscriber: Koavf.Jul 14 2019, 6:58 AM

Repeating question of @Urbanecm. This is starting to cause a lot of frustration among people working on test projects, as even the active and successful ones can't move forward. This is a real black eye on the whole movement, and needs high-priority attention (IMO).

Reedy added a comment.Jul 17 2019, 4:22 PM

I note new wikis were never created in a timely fashion anyway...

Ankry added a comment.Jul 17 2019, 6:19 PM

I note new wikis were never created in a timely fashion anyway...

But does this mean that they need to wait years for creation?

Reedy added a comment.Jul 17 2019, 7:45 PM

I note new wikis were never created in a timely fashion anyway...

But does this mean that they need to wait years for creation?

I don’t think any have been waiting (ready to actually be actioned) anywhere near that long

WDoranWMF triaged this task as Normal priority.Jul 18 2019, 4:05 PM
WDoranWMF removed a project: Core Platform Team.

No, but we do have one that has been waiting just about ten months, and a couple of others eight months. And that doesn't count things I've held out of the approval queue because there was no point. With all due respect to all of you—and especially volunteers—if it takes up to three months under ordinary circumstances, I get it. If it takes up to six months under extraordinary circumstances, I get it. But this is getting very, very difficult. @Reedy, we are now officially past "[not] in a timely fashion".

I note new wikis were never created in a timely fashion anyway...

The issue is not that. The issue is that we've lost the ability to create new wikis. It's kind of more serious because it means that no expansion or evolution of the project is possible.

Krenair added a comment.EditedJul 27 2019, 10:30 PM

Assigned to myself to track down ownership of the script

Avoiding breaking addWiki.php should be the responsibility of everyone with MediaWiki +2/deployment access or higher.

So I made some comments here, let me explain in more depth why fixing this takes really long time. I hope you understand the problem better:

  • This is not an issue in mediawiki core, it's an issue on wiring our mediawiki code to our infrastructure. The problem is it only happens in production and while you are doing the operation of making a new wiki, there's no other way to reproduce it (except in beta cluster, but that also is a lot of hassle). So It's really really hard to find out what's wrong, you can't "try" it, change something, try again to see if it works. This makes fixing the issue quite complicated and time-consuming.
  • the buggy code it's not owned by any team, each team at WMF has clear roles and responsibility (so people don't step on each other's toes, they deliver the things it's expected from them, they have clear planning, roadmap, etc.) and this part of the code doesn't fit in any of teams roles, making it completely unmaintained. On this matter, I am as volunteer as you are (and having expectations from me on this is not fun). This combined with the above issue, makes it really hard to fix, because I can just spend an afternoon and find out what's wrong. In fact this code quite often has issues, it's just the old ones were easy to solve.
  • Each code that goes to production has to be reviewed by someone else, Now, you need another volunteer developer with the rights to actually approve your patches. That makes any proper change take really long time.

I will try to spend some time on it, next week evenings, but please be patient with me and don't yell at me. As I said I'm also volunteer on this.

@kchapman: For long-term, maybe #Growth can own it?

Change 525999 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/WikimediaMaintenance@master] Make AddWiki::execute() smaller

https://gerrit.wikimedia.org/r/525999

Change 526572 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/WikimediaMaintenance@master] Catch, re-check, and retry in case of DB error in creating ES tables

https://gerrit.wikimedia.org/r/526572

aaron added a comment.Jul 31 2019, 2:37 AM

How much of this is unique from T205936 ?

How much of this is unique from T205936 ?

I might be wrong but it seems they are different issues because this one says the ES db is read-only and it can't change anything there while the mentioned bug (which is really bad too) can't even read the data from ES. It might be that they have the same cause with ES going read-only causing writes to get corrupt due to lack of ACID in our code handing ES storage and then it can't read them because they are corrupt (just thinking out loud, can't say for sure)

Is https://phabricator.wikimedia.org/T212881#5195101 the error that still happens or is it the read-only one too?

Is https://phabricator.wikimedia.org/T212881#5195101 the error that still happens or is it the read-only one too?

I didn't get that error in production when trying to make hywwiki. Maybe it's a different error.

A couple of related and more overarching proposals are T158730: Automate WMF wiki creation and T228745: Allow creating an independent "incubator wiki" instead of hosting all new wikis in one Incubator wiki with prefixes. They are closely related and overlapping, but T158730 is described from the internal, technical side, whereas T228745 is more functional and user-level, so they should be kept distinct and inform each other (I hope it makes sense).

I'm not suggesting to abandon this task here and wait until they are resolved. There are problems to resolve now: actual wiki communities that are trying to work and are limited. I am mentioning them because I recently had a conversation with @greg, @mark, and @thcipriani about this, and they were asking for examples of current issues that make wiki creation hard and unstable.

We also discussed the frequency with which new wikis are created. It's true that just a handful of new wikis are created each year, and this may make people think that these issues are not a high priority. But it's the other way around: more wikis should be created each year: dozens, if not hundreds, and the lack of a stable, automatic wiki creation process is one of the reasons they are created so rarely.

Change 526572 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/WikimediaMaintenance@master] Catch, re-check, and retry in case of DB error in creating ES tables
https://gerrit.wikimedia.org/r/526572

Anyone to merge this? So I can give it a try later?

A couple of related and more overarching proposals are T158730: Automate WMF wiki creation and T228745: Allow creating an independent "incubator wiki" instead of hosting all new wikis in one Incubator wiki with prefixes. They are closely related and overlapping, but T158730 is described from the internal, technical side, whereas T228745 is more functional and user-level, so they should be kept distinct and inform each other (I hope it makes sense).
I'm not suggesting to abandon this task here and wait until they are resolved. There are problems to resolve now: actual wiki communities that are trying to work and are limited. I am mentioning them because I recently had a conversation with @greg, @mark, and @thcipriani about this, and they were asking for examples of current issues that make wiki creation hard and unstable.
We also discussed the frequency with which new wikis are created. It's true that just a handful of new wikis are created each year, and this may make people think that these issues are not a high priority. But it's the other way around: more wikis should be created each year: dozens, if not hundreds, and the lack of a stable, automatic wiki creation process is one of the reasons they are created so rarely.

While in general I agree with this but I want to mention that even our SWAT deployments not automatic, not to mention complex things like train, etc. I would be more than happy to split and break down addWiki.php and automate parts of it but I don't think automating the whole process is going to be doable any time soon.

While in general I agree with this but I want to mention that even our SWAT deployments not automatic, not to mention complex things like train, etc. I would be more than happy to split and break down addWiki.php and automate parts of it but I don't think automating the whole process is going to be doable any time soon.

Sure, I know it won't be very quick. But it has to start somewhere :)

Mentioned in SAL (#wikimedia-operations) [2019-08-14T12:24:49Z] <James_F|Away> We're going to try making a new wiki. T212881

Reedy added a comment.Aug 14 2019, 1:08 PM

@aaron here's the state of play:

Ok, so we just created napwikisource. addWiki.php executed without any visible errors:

reedy@mwmaint1002:/srv/mediawiki$ mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki nap wikisource napwikisource nap.wikisource.org
Creating database napwikisource for nap.wikisource (Napulitano)
Initialising tables
Initialising external storage cluster24...
Initialising external storage cluster25...
Writing main page to Paggena_prencepale
Writing sidebar donate link to MediaWiki:Sitesupport-url
indexing namespaces...
	Indexing namespaces...done
content index...
	Fetching Elasticsearch version...6.5.4...ok
	Scanning available plugins...
		analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
		analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
		extra-analysis-serbian, extra-analysis-slovak, ltr
	Picking analyzer...default
	Inferring index identifier...napwikisource_content_first
	Creating index...ok
		Validating number of shards...ok
		Validating replica range...ok
		Validating shard allocation settings...done
		Validating max shards per node...ok
	Validating analyzers...ok
	Validating mappings...
		Validating mapping...different...corrected
	Validating aliases...
		Validating napwikisource_content alias...alias is free...corrected
		Validating napwikisource alias...alias not already assigned to this index...corrected
		Updating tracking indexes...done
general index...
	Fetching Elasticsearch version...6.5.4...ok
	Scanning available plugins...
		analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
		analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
		extra-analysis-serbian, extra-analysis-slovak, ltr
	Picking analyzer...default
	Inferring index identifier...napwikisource_general_first
	Creating index...ok
		Validating number of shards...ok
		Validating replica range...ok
		Validating shard allocation settings...done
		Validating max shards per node...ok
	Validating analyzers...ok
	Validating mappings...
		Validating mapping...different...corrected
	Validating aliases...
		Validating napwikisource_general alias...alias is free...corrected
		Validating napwikisource alias...alias not already assigned to this index...corrected
		Updating tracking indexes...done
archive index...
	Fetching Elasticsearch version...6.5.4...ok
	Scanning available plugins...
		analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
		analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
		extra-analysis-serbian, extra-analysis-slovak, ltr
	Picking analyzer...default
	Inferring index identifier...napwikisource_archive_first
	Creating index...ok
		Validating number of shards...ok
		Validating replica range...ok
		Validating shard allocation settings...done
		Validating max shards per node...ok
	Validating analyzers...ok
	Validating mappings...
		Validating mapping...different...corrected
	Validating aliases...
		Validating napwikisource_archive alias...alias is free...corrected
		Updating tracking indexes...done
	indexing namespaces...
		Indexing namespaces...done
	content index...
		Fetching Elasticsearch version...6.5.4...ok
		Scanning available plugins...
			analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
			analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
			extra-analysis-serbian, extra-analysis-slovak, ltr
		Picking analyzer...default
		Inferring index identifier...napwikisource_content_first
		Creating index...ok
			Validating number of shards...ok
			Validating replica range...ok
			Validating shard allocation settings...done
			Validating max shards per node...ok
		Validating analyzers...ok
		Validating mappings...
			Validating mapping...different...corrected
		Validating aliases...
			Validating napwikisource_content alias...alias is free...corrected
			Validating napwikisource alias...alias not already assigned to this index...corrected
			Updating tracking indexes...done
	general index...
		Fetching Elasticsearch version...6.5.4...ok
		Scanning available plugins...
			analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
			analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
			extra-analysis-serbian, extra-analysis-slovak, ltr
		Picking analyzer...default
		Inferring index identifier...napwikisource_general_first
		Creating index...ok
			Validating number of shards...ok
			Validating replica range...ok
			Validating shard allocation settings...done
			Validating max shards per node...ok
		Validating analyzers...ok
		Validating mappings...
			Validating mapping...different...corrected
		Validating aliases...
			Validating napwikisource_general alias...alias is free...corrected
			Validating napwikisource alias...alias not already assigned to this index...corrected
			Updating tracking indexes...done
	archive index...
		Fetching Elasticsearch version...6.5.4...ok
		Scanning available plugins...
			analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
			analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
			extra-analysis-serbian, extra-analysis-slovak, ltr
		Picking analyzer...default
		Inferring index identifier...napwikisource_archive_first
		Creating index...ok
			Validating number of shards...ok
			Validating replica range...ok
			Validating shard allocation settings...done
			Validating max shards per node...ok
		Validating analyzers...ok
		Validating mappings...
			Validating mapping...different...corrected
		Validating aliases...
			Validating napwikisource_archive alias...alias is free...corrected
			Updating tracking indexes...done
		indexing namespaces...
			Indexing namespaces...done
		content index...
			Fetching Elasticsearch version...6.5.4...ok
			Scanning available plugins...
				analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
				analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
				extra-analysis-serbian, extra-analysis-slovak, ltr
			Picking analyzer...default
			Inferring index identifier...napwikisource_content_first
			Creating index...ok
				Validating number of shards...ok
				Validating replica range...ok
				Validating shard allocation settings...done
				Validating max shards per node...ok
			Validating analyzers...ok
			Validating mappings...
				Validating mapping...different...corrected
			Validating aliases...
				Validating napwikisource_content alias...alias is free...corrected
				Validating napwikisource alias...alias not already assigned to this index...corrected
				Updating tracking indexes...done
		general index...
			Fetching Elasticsearch version...6.5.4...ok
			Scanning available plugins...
				analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
				analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
				extra-analysis-serbian, extra-analysis-slovak, ltr
			Picking analyzer...default
			Inferring index identifier...napwikisource_general_first
			Creating index...ok
				Validating number of shards...ok
				Validating replica range...ok
				Validating shard allocation settings...done
				Validating max shards per node...ok
			Validating analyzers...ok
			Validating mappings...
				Validating mapping...different...corrected
			Validating aliases...
				Validating napwikisource_general alias...alias is free...corrected
				Validating napwikisource alias...alias not already assigned to this index...corrected
				Updating tracking indexes...done
done.
Making sure mwstore://local-multiwrite/local-public exists...making 'mwstore://local-multiwrite/local-public' public...done.
Making sure mwstore://local-multiwrite/local-thumb exists...making 'mwstore://local-multiwrite/local-thumb' public...done.
Making sure mwstore://local-multiwrite/local-transcoded exists...making 'mwstore://local-multiwrite/local-transcoded' public...done.
Making sure mwstore://local-multiwrite/local-temp exists...making 'mwstore://local-multiwrite/local-temp' private...done.
Making sure mwstore://local-multiwrite/local-deleted exists...making 'mwstore://local-multiwrite/local-deleted' private...done.
Making sure mwstore://local-multiwrite/timeline-render exists...making 'mwstore://local-multiwrite/timeline-render' public...done.
Done. sync the config as in https://wikitech.wikimedia.org/wiki/Add_a_wiki#MediaWiki_configuration

However, visiting the wiki...

[XVQCfgpAAC4AAAdwgk8AAAAB] 2019-08-14 12:45:50: Errore irreversibbele 'e tipo "MediaWiki\Revision\RevisionAccessException"

2019-08-14 12:45:50 [XVQCfgpAAC4AAAdwgk8AAAAB] mwdebug1002 napwikisource 1.34.0-wmf.17 exception ERROR: [XVQCfgpAAC4AAAdwgk8AAAAB] /wiki/Paggena_prencepale   MediaWiki\Revision\RevisionAccessException from line 1448 of /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionStore.php: Failed to load data blob from tt:1: Failed to load blob from address tt:1 {"exception_id":"XVQCfgpAAC4AAAdwgk8AAAAB","exception_url":"/wiki/Paggena_prencepale","caught_by":"mwe_handler"} 
[Exception MediaWiki\Revision\RevisionAccessException] (/srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionStore.php:1448) Failed to load data blob from tt:1: Failed to load blob from address tt:1
  #0 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionStore.php(1653): MediaWiki\Revision\RevisionStore->loadSlotContent(MediaWiki\Revision\SlotRecord, NULL, NULL, NULL, integer)
  #1 [internal function]: Closure$MediaWiki\Revision\RevisionStore::constructSlotRecords(MediaWiki\Revision\SlotRecord)
  #2 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/SlotRecord.php(307): call_user_func(Closure$MediaWiki\Revision\RevisionStore::constructSlotRecords;4585, MediaWiki\Revision\SlotRecord)
  #3 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionRecord.php(175): MediaWiki\Revision\SlotRecord->getContent()
  #4 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RenderedRevision.php(226): MediaWiki\Revision\RevisionRecord->getContent(string, integer, NULL)
  #5 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionRenderer.php(222): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string)
  #6 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionRenderer.php(151): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
  #7 [internal function]: Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#3(MediaWiki\Revision\RenderedRevision, array)
  #8 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RenderedRevision.php(197): call_user_func(Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#3;5923, MediaWiki\Revision\RenderedRevision, array)
  #9 /srv/mediawiki/php-1.34.0-wmf.17/includes/poolcounter/PoolWorkArticleView.php(196): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
  #10 /srv/mediawiki/php-1.34.0-wmf.17/includes/poolcounter/PoolCounterWork.php(125): PoolWorkArticleView->doWork()
  #11 /srv/mediawiki/php-1.34.0-wmf.17/includes/page/Article.php(776): PoolCounterWork->execute()
  #12 /srv/mediawiki/php-1.34.0-wmf.17/includes/actions/ViewAction.php(63): Article->view()
  #13 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(507): ViewAction->show()
  #14 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(302): MediaWiki->performAction(Article, Title)
  #15 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(892): MediaWiki->performRequest()
  #16 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(523): MediaWiki->main()
  #17 /srv/mediawiki/php-1.34.0-wmf.17/index.php(42): MediaWiki->run()
  #18 /srv/mediawiki/w/index.php(3): include(string)
  #19 {main}
Caused by: [Exception MediaWiki\Storage\BlobAccessException] (/srv/mediawiki/php-1.34.0-wmf.17/includes/Storage/SqlBlobStore.php:292) Failed to load blob from address tt:1
  #0 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionStore.php(1444): MediaWiki\Storage\SqlBlobStore->getBlob(string, integer)
  #1 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionStore.php(1653): MediaWiki\Revision\RevisionStore->loadSlotContent(MediaWiki\Revision\SlotRecord, NULL, NULL, NULL, integer)
  #2 [internal function]: Closure$MediaWiki\Revision\RevisionStore::constructSlotRecords(MediaWiki\Revision\SlotRecord)
  #3 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/SlotRecord.php(307): call_user_func(Closure$MediaWiki\Revision\RevisionStore::constructSlotRecords;4585, MediaWiki\Revision\SlotRecord)
  #4 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionRecord.php(175): MediaWiki\Revision\SlotRecord->getContent()
  #5 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RenderedRevision.php(226): MediaWiki\Revision\RevisionRecord->getContent(string, integer, NULL)
  #6 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionRenderer.php(222): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string)
  #7 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RevisionRenderer.php(151): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
  #8 [internal function]: Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#3(MediaWiki\Revision\RenderedRevision, array)
  #9 /srv/mediawiki/php-1.34.0-wmf.17/includes/Revision/RenderedRevision.php(197): call_user_func(Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#3;5923, MediaWiki\Revision\RenderedRevision, array)
  #10 /srv/mediawiki/php-1.34.0-wmf.17/includes/poolcounter/PoolWorkArticleView.php(196): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
  #11 /srv/mediawiki/php-1.34.0-wmf.17/includes/poolcounter/PoolCounterWork.php(125): PoolWorkArticleView->doWork()
  #12 /srv/mediawiki/php-1.34.0-wmf.17/includes/page/Article.php(776): PoolCounterWork->execute()
  #13 /srv/mediawiki/php-1.34.0-wmf.17/includes/actions/ViewAction.php(63): Article->view()
  #14 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(507): ViewAction->show()
  #15 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(302): MediaWiki->performAction(Article, Title)
  #16 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(892): MediaWiki->performRequest()
  #17 /srv/mediawiki/php-1.34.0-wmf.17/includes/MediaWiki.php(523): MediaWiki->main()
  #18 /srv/mediawiki/php-1.34.0-wmf.17/index.php(42): MediaWiki->run()
  #19 /srv/mediawiki/w/index.php(3): include(string)
  #20 {main}

Which looks like (but not fully confirmed) to be T205936: Unable to view some pages due to fatal RevisionAccessException: "Failed to load data blob from tt"

If we look at text on napwikisource

wikiadmin@10.64.48.35(napwikisource)> select * from text;
+--------+--------------------+---------------------+
| old_id | old_text           | old_flags           |
+--------+--------------------+---------------------+
|      1 | DB://cluster25/256 | utf-8,gzip,external |
|      2 | DB://cluster25/257 | utf-8,gzip,external |
+--------+--------------------+---------------------+
2 rows in set (0.00 sec)

Why is it using ids of 256 and 257 for the first revision?

If we look at aawiki text table, this looks fishy...

wikiadmin@10.64.48.35(aawiki)> select * from text ORDER BY old_id DESC LIMIT 1;
+--------+---------------+-----------+--------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
| old_id | old_namespace | old_title | old_text           | old_comment | old_user | old_user_text | old_timestamp | old_minor_edit | old_flags           | inverse_timestamp |
+--------+---------------+-----------+--------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
|  10817 |               |           | DB://cluster25/255 |             |          |               |               |                | utf-8,gzip,external |                   |
+--------+---------------+-----------+--------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
1 row in set (0.00 sec)

If we create a new page (for example "Sandbox" on napwikisource)... It works fine

wikiadmin@10.64.48.35(napwikisource)> select * from page;
+---------+----------------+--------------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title         | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+--------------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|       1 |              0 | Paggena_prencepale |                   |                0 |           1 | 0.722331198926 | 20190814124100 | 20190814124100     |           1 |     1345 | wikitext           | NULL      |
|       2 |              8 | Sitesupport-url    |                   |                0 |           1 |  0.36662219631 | 20190814124100 | 20190814124100     |           2 |      110 | wikitext           | NULL      |
|       3 |              0 | Sandbox            |                   |                0 |           1 |  0.31009594043 | 20190814125456 | 20190814125456     |           3 |       10 | wikitext           | NULL      |
+---------+----------------+--------------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
3 rows in set (0.00 sec)

wikiadmin@10.64.48.35(napwikisource)> select * from revision;
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp  | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1                        | rev_content_model | rev_content_format |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
|      1 |        1 |           1 |             |        0 |               | 20190814124100 |              0 |           0 |    1345 |             0 | kzxgim5xw3lahegt3rq4f4wg0q428wo | NULL              | NULL               |
|      2 |        2 |           2 |             |        0 |               | 20190814124100 |              0 |           0 |     110 |             0 | e83nrakkhufq82zfs2bacwjs3m45wa3 | NULL              | NULL               |
|      3 |        3 |           3 |             |        0 |               | 20190814125456 |              0 |           0 |      10 |             0 | jrhglkly97hqxkhim2h9zmyde0vhtpo | NULL              | NULL               |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
3 rows in set (0.00 sec)

wikiadmin@10.64.48.35(napwikisource)> select * from text;
+--------+--------------------+---------------------+
| old_id | old_text           | old_flags           |
+--------+--------------------+---------------------+
|      1 | DB://cluster25/256 | utf-8,gzip,external |
|      2 | DB://cluster25/257 | utf-8,gzip,external |
|      3 | DB://cluster25/1   | utf-8,gzip,external |
+--------+--------------------+---------------------+
3 rows in set (0.00 sec)

wikiadmin@10.64.48.35(napwikisource)>

The ids look weird. It is using the wrong first two ids... possibly something to do with aawiki? Why does it choose 256 and 257 for the first text rows for aawiki?

wikiadmin@10.64.48.35(aawiki)> select * from text ORDER BY old_id DESC LIMIT 10;
+--------+---------------+-----------+--------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
| old_id | old_namespace | old_title | old_text           | old_comment | old_user | old_user_text | old_timestamp | old_minor_edit | old_flags           | inverse_timestamp |
+--------+---------------+-----------+--------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
|  10817 |               |           | DB://cluster25/255 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10816 |               |           | DB://cluster24/268 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10815 |               |           | DB://cluster24/267 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10814 |               |           | DB://cluster25/254 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10813 |               |           | DB://cluster25/253 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10812 |               |           | DB://cluster25/252 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10811 |               |           | DB://cluster25/251 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10810 |               |           | DB://cluster24/266 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10809 |               |           | DB://cluster24/265 |             |          |               |               |                | utf-8,gzip,external |                   |
|  10808 |               |           | DB://cluster24/264 |             |          |               |               |                | utf-8,gzip,external |                   |
+--------+---------------+-----------+--------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
10 rows in set (0.00 sec)

Need to look at ES...

This comment was removed by Reedy.
This comment was removed by Reedy.

This ticket was identified as high priority at Wikimania - @kchapman is this something your team could take on?

kchapman removed kchapman as the assignee of this task.Wed, Sep 4, 8:22 AM
kchapman raised the priority of this task from Normal to High.

@Tnegrin updating the priority to make it higher and it will be looked at through the Core Platform Team clinic duty.

Anomie added a subscriber: Anomie.Thu, Sep 5, 1:18 PM

@Tnegrin updating the priority to make it higher and it will be looked at through the Core Platform Team clinic duty.

...

  • This is not an issue in mediawiki core, it's an issue on wiring our mediawiki code to our infrastructure. The problem is it only happens in production and while you are doing the operation of making a new wiki, there's no other way to reproduce it (except in beta cluster, but that also is a lot of hassle). So It's really really hard to find out what's wrong, you can't "try" it, change something, try again to see if it works. This makes fixing the issue quite complicated and time-consuming.

This is the problem I'm running into when first looking into this. How do I even test it to try to see what's going on?

However,

If we look at text on napwikisource

wikiadmin@10.64.48.35(napwikisource)> select * from text;
+--------+--------------------+---------------------+
| old_id | old_text           | old_flags           |
+--------+--------------------+---------------------+
|      1 | DB://cluster25/256 | utf-8,gzip,external |
|      2 | DB://cluster25/257 | utf-8,gzip,external |
+--------+--------------------+---------------------+
2 rows in set (0.00 sec)

Why is it using ids of 256 and 257 for the first revision?
If we look at aawiki text table, this looks fishy...

[...]

Need to look at ES...

I did that. It turns out the two rows were inserted into ES for aawiki, rather than ES for napwikisource. That explains the weird IDs, and why napwikisource couldn't find them later.

anomie@mwmaint1002:~$ sql --wiki=aawiki --cluster cluster25 -- -s
wikiadmin@10.64.16.187(aawiki)> select blob_id, to_base64(blob_text) from blobs_cluster25 where blob_id in (256,257);
blob_id	to_base64(blob_text)
256	dVTBbtswDL3nK4hesg2ZezfqHlZgWA9dB3RAD4YPtM0kWmzRk+QaAfzxI+XYSYPuZlKP5OPTk7Ps\n99548H1Zc4vGggSOPLk3qmHLDsKeoHKEwbAF3gJCng/mYFqqDabPvYPO8R+qgh9fJe25dxUVBUir\n9Xot2JRs+hO7vjEBLY/nz6IQADRodz3uKMtWqy/wqyH0pJU1g+UAPqALILOCsTvFB6VraQBvAiUw\nsZdP2KMXcoF8mBmBMFb6eW5s1ZcY2KWRZOQOj3NS2H6STRd0SQF1xTdy0vmYjt8kAa/njBRc4rmp\nh2X1dHxuariU4jOgrcEEGEzTQElg2o5dEH0Dw54cJbr54xaO3MPAvZQ35kDTadNBcGh9gypAHGhs\nILfFKiKiHLOGG+gm/U7izZUUx2yg7APsWMvyfDlT6un4LkwshaLYiJy+o8pgA5qd7PA/MnoVKDvi\nEUhkOrIV36CFXuiYeBUxPXXq/Vyf5+2Qjh5bAs/bMKATxaIg32WasTKznbyHJQv9PQ/KXw0RZVVO\nLI0c7MiSE6qq2QY8xd7pD4nSBxaWVhyqkVJZTLD28CSX+1UT061GQh9XPWmFIhN2u8gyy+DFeFFg\neQQgPr7znSxeNeh9dtM18qoaYw/+5n6V394Ow5CoBp0200aRTIwKGC8QujS64ww5he8xpmQ++KVL\njK4Q8lTOAA2uzv/2LO9oBsToCjH5eIGcbP0ec3oYC2h+KFcoPopLz6AYnjDRaeST5d9yHjidXPWq\nMZwhGpzOK25bttd9HqZssbq71cu5X61exCCXNlCzjJdeWH5+zK42dvkBSk4sPd938g8=
257	JYxBDoAgDAR/4xHuJsSnmAoNNkohtI3fN8B1dnZu1Sa796kyKLqPHiqYCFzt2R+m5ZRqPWJYwjbI\nEKwEoYQX9IkilAaUOTC0GVmvUdlM8AXOY/oB

Un-base64ing and gzinflating those results produces the expected contents of the two napwikisource revisions.

I suppose that does give me a starting point for some static analysis and simulation. I doubt I'll be able to fully test a fix, though.

Anomie added a comment.Thu, Sep 5, 2:07 PM
This comment was removed by Anomie.

sending back to the Performance Team Inbox so @aaron can resolve as the issue appears to be in the load balancer code.

Based on T212881#5413761, there seem to be bad db records on inserted ES hosts; once the root problem is fixed let me know so we can do any necessary cleanup (you don't need a sysadmin for that, but I am offering my help to do it safely, like any database delete).

Anomie added a comment.Thu, Sep 5, 2:48 PM

Poking around using shell.php and tracing through the addWiki.php code.

anomie@mwmaint1002:~$ mwscript shell.php --wiki=aawiki
Psy Shell v0.9.9 (PHP 7.2.8-1+0~20180725124257.2+stretch~1.gbp571e56 — cli) by Justin Hileman
>>> $wgDBname;
=> "aawiki"
>>> $lbFactory = MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
=> Wikimedia\Rdbms\LBFactoryMulti {#209}
>>> $localLb = $lbFactory->getMainLB();
=> Wikimedia\Rdbms\LoadBalancer {#268}
>>> $localLb->getLocalDomainID();
=> "aawiki"
>>> $lbFactory->redefineLocalDomain( 'napwikisource' );
=> null
>>> $localLb->getLocalDomainID();
=> "napwikisource"

Checks out so far. Let's dig into what happens when it tries to create the page. WikiPage calls PageUpdater which eventually calls RevisionStore which eventually calls SqlBlobStore. The SqlBlobStore comes from MediaWikiServices, so

>>> $blobStore = Wikimedia\TestingAccessWrapper::newFromObject( MediaWiki\MediaWikiServices::getInstance()->getBlobStoreFactory()->newSqlBlobStore( false ) );
=> Wikimedia\TestingAccessWrapper {#2699
     +object: MediaWiki\Storage\SqlBlobStore {#2698},
   }
>>> $extStoreAccess = Wikimedia\TestingAccessWrapper::newFromObject( $blobStore->extStoreAccess );
=> Wikimedia\TestingAccessWrapper {#2727
     +object: ExternalStoreAccess {#2716},
   }
>>> $blobStore->dbDomain
=> false
>>> $extStoreAccess->storeFactory->getWriteBaseUrls()
=> [
     "DB://cluster24",
     "DB://cluster25",
   ]
>>> $store = Wikimedia\TestingAccessWrapper::newFromObject( $extStoreAccess->storeFactory->getStoreForUrl( 'DB://cluster25', [ 'domain' => false ] ) );
=> Wikimedia\TestingAccessWrapper {#2692
     +object: ExternalStoreDB {#2739},
   }
>>> $extStoreAccess->storeFactory->getStoreLocationFromUrl( 'DB://cluster25' )
=> "cluster25"
>>> $store->getMaster( 'cluster25' )->getDomainID();
=> "aawiki"

There's the problem, something in something used by ExternalStoreDB doesn't get affected by the redefineLocalDomain() called. Now where does $store->getMaster() wind up getting its info from?

>>> $lb = $store->getLoadBalancer( 'cluster25' )
=> Wikimedia\Rdbms\LoadBalancer {#483}
>>> $lb->getLocalDomainID()
=> "napwikisource"
>>> $store->getDomainId( $lb->getServerInfo( $lb->getWriterIndex() ) )
=> "aawiki"
>>> $lb->getServerInfo( $lb->getWriterIndex() )
=> [
     "blobs table" => "blobs_cluster25",
     "flags" => 0,
     "dbname" => "aawiki",
     "user" => "wikiadmin",
     "password" => [REDACTED],
     "type" => "mysql",
     "max lag" => 6,
     "variables" => [
       "innodb_lock_wait_timeout" => 15,
     ],
     "master" => true,
     "host" => "10.64.32.65",
     "hostName" => "10.64.32.65",
     "load" => 0,
   ]

Ah ha. The ->redefineLocalDomain() call doesn't change the actual server configuration in the LoadBalancer and LBFactory, it overrides the "domain" passed to the Database instance. But ExternalStoreDB is checking the original configuration to fetch the domain when it has false as its configured domain. A comment in ExternalStoreDB references T200471: LBFactorySimple breaks ExternalStorage, trying to connect to external server with local database name as the reason for that.

Since LoadBalancer stuff is mostly handled by @aaron, and in particular he made the change that caused T200471 as well as the change that fixed it, I'd ask him to look at this. A minimal reproduction seems to be

if ( $wgDBname !== 'aawiki' ) {
    echo "Run with --wiki=aawiki\n";
    exit( 1 );
}

$services = MediaWiki\MediaWikiServices::getInstance();
$services->getDBLoadBalancerFactory()->redefineLocalDomain( 'napwikisource' )
echo $services->getExternalStoreFactory()->getStoreForUrl( 'DB://cluster25', [ 'domain' => false ] )->getMaster( 'cluster25' )->getDomainID() . "\n";
// Should be napwikisource, not aawiki
Gilles assigned this task to aaron.Mon, Sep 9, 8:05 PM
Gilles moved this task from Inbox to Doing on the Performance-Team board.