Page MenuHomePhabricator

Enable constraint result caching on Wikidata
Closed, ResolvedPublic

Description

Revert this change, so that caching of constraint check results is enabled on Wikidata.

Related Objects

StatusSubtypeAssignedTask
ResolvedLucas_Werkmeister_WMDE
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedNone
ResolvedLucas_Werkmeister_WMDE
InvalidNone
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
DeclinedNone
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
OpenNone
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
Resolvedjcrespo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think we can proceed with this – all the blockers should be addressed, though sometimes temporarily:

All these fixes are part of wmf.22. However, unfortunately the train is currently blocked – see T183961: 1.31.0-wmf.22 deployment blockers. I’ll submit Ia418cf08e2 for SWAT as soon as the train gets moving again.

Change 413724 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable caching of constraint check results

https://gerrit.wikimedia.org/r/413724

Mentioned in SAL (#wikimedia-operations) [2018-02-26T14:28:10Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:413724|Enable caching of constraint check results (T184812)]] (duration: 00m 55s)

Lucas_Werkmeister_WMDE claimed this task.

Done, and seems to be working on Wikidata.

jcrespo raised the priority of this task from Medium to Unbreak Now!.

Ouch. Please feel free to revert the config change https://gerrit.wikimedia.org/r/413724 – even if it turns out to be unrelated, it won’t hurt our users too much.

@Ladsgroup will be back in an hour, but he recommends that you first revert my (Lucas’) change (above), and if that doesn’t help, it might be his change https://gerrit.wikimedia.org/r/414667 (or https://gerrit.wikimedia.org/r/414654 – I’m not 100% sure which one he meant, but almost certainly not https://gerrit.wikimedia.org/r/414650). HTH

To clarify – SQL queries? Because I can’t think of anything SQL-related in that change… it shouldn’t result in any new SQL queries, and should indirectly have reduced the number of SQL queries (because some Wikidata items would no longer need to be loaded). But perhaps there’s something silly in the code…

This deserves an incident report.

Because I can’t think of anything SQL-related in that change… it shouldn’t result in any new SQL queries

Okay, that was wrong. There are new SQL queries: for each wbcheckconstraints request, one or two queries to get the latest revision ID for a set of entity IDs (one for a cache hit to see if the cached result is still valid, one for a cache miss to add the entity IDs to the cache, and it’s possible for both to occur in one request if the cached result is invalid), via WikiPageEntityMetaDataLookup. And I think that set isn’t explicitly limited to any particular size anywhere, which is probably bad.

Do you have the query text of the killed queries? Were they queries like this code could produce – joining page, revision, and text? yup (hadn’t seen @jcrespo’s comment when I wrote this)

Something like:

SELECT /* Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::selectRevisionInformationMultiple */ rev_id, rev_content_format, rev_timestamp, page_latest, page_is_redirect, old_id, old_text, old_flags, page_title FROM `page` INNER JOIN `revision` ON ((page_latest=rev_id)) INNER JOIN `text` ON ((old_id=rev_text_id)) /* db1109 wikidatawiki 54s */

Okay, then it’s almost certainly caused by this change, yes :( and I’m going to guess that the most likely reason for this query to become so slow would be that we’re asking for way too many revision IDs. Can you perhaps confirm this? The WHERE would have something like one Q1234=page_title AND 0=page_namespace per entity ID.

I hope that this is just some freaky outlier, and that the cached constraint check results for most items don’t require too many revision IDs. In that case, we could disable the caching for all results that depend on too many entity IDs.

The WHERE would have something like one Q1234=page_title AND 0=page_namespace per entity ID.

An alternative way to confirm this would be to check the latestRevisionIds in a cached value… perhaps someone can look at the cached value for WikibaseQualityConstraints::checkConstraints::v2::Q42::en, specifically the length of its latestRevisionIds member?

Okay, so it’s actually the opposite. We’re somehow asking for the latest revision IDs of an empty array of entity IDs. And WikiPageEntityMetaDataLookup has a special safeguard for this:

		if ( empty( $where ) ) {
			// If we skipped all entity IDs, select nothing, not everything.
			return '0';
		}

And here’s what Database::selectSQLText does with $conds => '0'.

		if ( !empty( $conds ) ) {
			if ( is_array( $conds ) ) {
				$conds = $this->makeList( $conds, self::LIST_AND );
			}
			$sql = "SELECT $startOpts $vars $from $useIndex $ignoreIndex " .
				"WHERE $conds $preLimitTail";
		} else {
			$sql = "SELECT $startOpts $vars $from $useIndex $ignoreIndex $preLimitTail"; // YOU ARE HERE
		}

So we were actually asking for all of the revision IDs. All of them. Some safeguard…

I thought you had posted/screenshoted reduced versions of the query, and omitted the WHERE part because it could contain sensitive information or whatever… no, the query just didn’t have any WHERE, that’s why it was so expensive.

Lucas_Werkmeister_WMDE lowered the priority of this task from Unbreak Now! to High.Feb 26 2018, 6:52 PM
SELECT $startOpts $vars $from $useIndex $ignoreIndex " .
				"WHERE $conds $preLimitTail";

That looks like a recipe for an sql injection, how did this pass security review?

That looks like a recipe for an sql injection, how did this pass security review?

Sorry, I should’ve been clearer – this is in Database::selectSQLText. That’s the one place where code like this would be expected, right? (Assuming that all the variables have been sanitized properly before.)

ok, sorry, for a second I thought you were executing that- on that code is guaranteed to be sanitized.

SELECT $startOpts $vars $from $useIndex $ignoreIndex " .
				"WHERE $conds $preLimitTail";

That looks like a recipe for an sql injection, how did this pass security review?

Why this is public?

Ladsgroup set Security to Software security bug.Feb 26 2018, 7:10 PM
Ladsgroup added a project: acl*security.
Ladsgroup changed the visibility from "Public (No Login Required)" to "Custom Policy".
Ladsgroup changed the visibility from "Custom Policy" to "Public (No Login Required)".

I’ve started writing up an incident report, but I can’t currently save it since wikitech is read-only due to a database migration. If it’s not back to read-write by the time I leave the office, I’ll dump it in an Etherpad or something.

Edit: https://etherpad.wikimedia.org/p/Incident_documentation_20180226_WBQC

Php considering "0" to be false strikes again!

Wikitech is writable again, here’s the report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180226-WikibaseQualityConstraints

I’ll ask @Ladsgroup to send out the email to ops@ tomorrow (AFAIK I don’t have the required permissions for that, and anyways it’s probably good to have another set of eyes on that document before it goes out).

Well, it seems wikitech is completely down now.

Several of the actionables have been implemented; the main one was merged just before the wmf.23 branch cut, others I’ve now requested to be backported for the European Mid-Day SWAT today. If it’s okay, I’d like to try enabling caching again in today’s Morning SWAT…

Can you wait on T188505 ? I don't think there is a reason why we should rush a deployment after it failed without a proper investigation. CC @greg

Okay… is there anything I can do to help move that task forward?

The reason I want to get caching into place soon is that we’ll roll out constraint checks to more users in the next weeks (starting on March 1st to all usernames beginning with “Z”, see T184069#4000941 for details), so we will start to see more constraint check requests. I think we can bear the first group (just “Z”) without caching, but after that we would probably have to pause the rollout if we can’t get caching working by then. (See also T188386: Delay rollout of constraint violation gadget to all users?.)

So the reason to rush is not that other things are broken, but that you want to deploy more things?

is there anything I can do to help move that task forward?

The ticket is open for anyone to help on it, the more help, the faster it can be closed.

@Lydia_Pintscher I consider this is being pushed unnecessarily fast after an actual outage happend, without proper investigation of the causes leading to them-- that looks like a disregard for the site's reliability, and I do not like it. There is a specific, very small actionable I pointed as a blocker to continue the deployment: T188505

Okay… is there anything I can do to help move that task forward?

The reason I want to get caching into place soon is that we’ll roll out constraint checks to more users in the next weeks (starting on March 1st to all usernames beginning with “Z”, see T184069#4000941 for details), so we will start to see more constraint check requests. I think we can bear the first group (just “Z”) without caching, but after that we would probably have to pause the rollout if we can’t get caching working by then. (See also T188386: Delay rollout of constraint violation gadget to all users?.)

I don't think we should be rushing things here, we already had a close call on Monday.
We were pretty pretty pretty close to saturate the network interfaces of the databases.
Our logtash logged thousands of errors during the time the change was in production, and again, we were pretty close to completely saturate the servers, which could have ended up in a serious outage (throwing errors to 100% of the requests) and probably also overloading the master.

If we are going to keep rolling new things this fast, I would consider investigating and fixing T188505 a must. Why? Basically to be on the safe side, in case another bad deployment happens. It is better to kill queries than get all the servers overloaded and suffer a major outage for wikidata.

Hey,
We totally understand T188505 is a blocker for further work on caching-side. It was more of a question of how we should proceed with the deployment of something else that might get affected by caching not enabled. If we don't enable caching, other problems might arise or might not, we are not sure. It's a rather complex situation. This wasn't a discussion about whether T188505 should be a blocker of enabling caching for constraint checks or not. If you think there was any sort of push, I can assure you, it was a misunderstanding and it's not our intention and I apologize if that was the impression.

Hey,
We totally understand T188505 is a blocker for further work on caching-side. It was more of a question of how we should proceed with the deployment of something else that might get affected by caching not enabled. If we don't enable caching, other problems might arise or might not, we are not sure. It's a rather complex situation. This wasn't a discussion about whether T188505 should be a blocker of enabling caching for constraint checks or not.

Given how big the problem was, I am a bit concerned about deploying more things to be honest. Specially if it is such a complex situation that we don't really know which things could pop up.
The idea of investigating and fixing T188505 is more to have a safety net in case unexpected things come up, at least we will be killing bad queries and preventing those to bring down the whole wikidata infra.

If there are such big doubts about what could/could not happen if we do/don't enable caching, from my point of view, we should hold any new deployment related to this until we have fixed the mentioned killing query task.

If you think there was any sort of push, I can assure you, it was a misunderstanding and it's not our intention and I apologize if that was the impression.

Thanks for the clarification - we were basically trying to protect the whole wikidata servers :-)

@Lydia_Pintscher I consider this is being pushed unnecessarily fast after an actual outage happend, without proper investigation of the causes leading to them-- that looks like a disregard for the site's reliability, and I do not like it. There is a specific, very small actionable I pointed as a blocker to continue the deployment: T188505

Yeah indeed. Sorry. Bad timing. Looking into delaying now.

Apologies for being pushy earlier – I definitely want the Wikidata servers to stay alive as well :)

The bug that caused the incident should be thoroughly fixed now, with several commits already in wmf.23 (either before the cut or backported) and some more in wmf.24, and you managed to fix the query killer as well (thanks a lot!). Would it be okay if we try to enable caching again after the next train – perhaps on Monday? I promise I’ll keep a very close look on the Grafana boards to see if anything explodes ;)

Yes, I said I wanted to block it on the ticket I mentioned before, which is now resolved.

Change 416748 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/mediawiki-config@master] Enable caching of constraint check results

https://gerrit.wikimedia.org/r/416748

Alright, I’ve added it to next Monday’s EU SWAT.

Change 416748 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable caching of constraint check results

https://gerrit.wikimedia.org/r/416748

Mentioned in SAL (#wikimedia-operations) [2018-03-12T13:17:22Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 03m 09s)

Mentioned in SAL (#wikimedia-operations) [2018-03-12T13:31:34Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 03m 08s)

Mentioned in SAL (#wikimedia-operations) [2018-03-12T13:59:57Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 00m 57s)

Deployed, and so far the servers seem to be happy, but I’ll check again in a couple of hours before closing this.

Everything still okay as far as I can tell. I think we can close this.

238482n375 lowered the priority of this task from High to Lowest.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu