Enable constraint result caching on Wikidata
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lucas_Werkmeister_WMDE
	Jan 12 2018, 5:55 PM

Description

Revert this change, so that caching of constraint check results is enabled on Wikidata.

Details

	Subject	Repo	Branch	Lines +/-
	Enable caching of constraint check results	operations/mediawiki-config	master	+1 -0
	Enable caching of constraint check results	operations/mediawiki-config	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Lucas_Werkmeister_WMDE	T173695 Enable constraint checks by default for users
Open	None	T103228 Improve performance of constraint check
Resolved	Lydia_Pintscher	T179839 Cache constraint check results
Resolved	Lydia_Pintscher	T179849 Cache all constraint check results per-entity
Resolved	Lucas_Werkmeister_WMDE	T181060 Cache constraint check results per-entity in ObjectCache (L) (days: 2)
Resolved	Lucas_Werkmeister_WMDE	T184812 Enable constraint result caching on Wikidata
Resolved	None	T183927 Add parameter to only get violations from wbcheckconstraints (days: 1)
Resolved	Lucas_Werkmeister_WMDE	T184937 Change wbcheckconstraints’ status parameter’s default value to cacheable value
Invalid	None	T185104 status parameter removes too many results
Resolved	Lucas_Werkmeister_WMDE	T185688 Constraint check results are cached independent of language
Resolved	Lucas_Werkmeister_WMDE	T185689 Add language to cache key for check results
Resolved	Lucas_Werkmeister_WMDE	T185709 Cache CheckResult serializations per-entity in ObjectCache
Resolved	Lucas_Werkmeister_WMDE	T185710 Introduce class for constraint violation messages
Resolved	Lucas_Werkmeister_WMDE	T185999 Make ViolationMessage serializable
Resolved	Lucas_Werkmeister_WMDE	T187186 Use ViolationMessage in ConstraintParameterException
Resolved	Lucas_Werkmeister_WMDE	T187199 Remove support for plain string messages from ViolationMessageRenderer
Resolved	Lucas_Werkmeister_WMDE	T187201 Remove message argument related parts of ConstraintParameterRenderer
Resolved	Lucas_Werkmeister_WMDE	T187202 Remove injected ConstraintParameterRenderer from checkers and helpers
Resolved	Lucas_Werkmeister_WMDE	T185711 Make Constraint serializable
Resolved	Lucas_Werkmeister_WMDE	T185712 Make Context serializable
Declined	None	T185713 Make CheckResult status serializable
Resolved	Lucas_Werkmeister_WMDE	T185714 Make CheckResult serializable
Resolved	Lucas_Werkmeister_WMDE	T188629 Completely remove support for detail and detailHTML in wbcheckconstraints response
Resolved	Lucas_Werkmeister_WMDE	T189875 Migrate “exception” check results to ViolationMessage
Resolved	Lucas_Werkmeister_WMDE	T189593 Replace CachingResultsBuilder with CachingResultsSource, (de)serializing CheckResults
Open	None	T190672 Cache check results from Special:ConstraintReport checks
Resolved	Lucas_Werkmeister_WMDE	T190684 Serialize NullResults in CachingResultsSource
Resolved	Lucas_Werkmeister_WMDE	T187057 Constraint check results involving the current time are cached indiscriminately [M]
Resolved	Lucas_Werkmeister_WMDE	T187058 Introduce separate messages for range violations involving current time
Resolved	Lucas_Werkmeister_WMDE	T186008 Remove range in numerical notation from violation messages
Resolved	Lucas_Werkmeister_WMDE	T187061 Don’t cache constraint check results involving current time whose validity will change soon
Resolved	Lucas_Werkmeister_WMDE	T188311 Don’t pass empty entity ID lists to WikiPageMetaDataAccessor
Resolved	Lucas_Werkmeister_WMDE	T188312 Don’t pass overly long entity ID lists to WikiPageEntityDataAccessor
Resolved	Lucas_Werkmeister_WMDE	T188418 Deduplicate entity ID lists in DependencyMetadata
Resolved	jcrespo	T188505 Investigate why query killer didn't kill 1-hour long queries

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

gerritbot added a project: Patch-For-Review.Feb 23 2018, 1:41 PM

I think we can proceed with this – all the blockers should be addressed, though sometimes temporarily:

T185688: Constraint check results are cached independent of language is fixed for now by T185689: Add language to cache key for check results, even though the real fix T185709: Cache CheckResult serializations per-entity in ObjectCache is still in progress.
T187057: Constraint check results involving the current time are cached indiscriminately [M] is fixed per its two subtasks (properly).

All these fixes are part of wmf.22. However, unfortunately the train is currently blocked – see T183961: 1.31.0-wmf.22 deployment blockers. I’ll submit Ia418cf08e2 for SWAT as soon as the train gets moving again.

Change 413724 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable caching of constraint check results

https://gerrit.wikimedia.org/r/413724

Mentioned in SAL (#wikimedia-operations) [2018-02-26T14:28:10Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:413724|Enable caching of constraint check results (T184812)]] (duration: 00m 55s)

Done, and seems to be working on Wikidata.

WMDE-leszek moved this task from Backlog to Done on the Wikidata-Sprint-2018-02-14 board.Feb 26 2018, 3:35 PM

jcrespo reopened this task as Open.Feb 26 2018, 5:24 PM

jcrespo raised the priority of this task from Medium to Unbreak Now!.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptFeb 26 2018, 5:24 PM

We are investigating if this has caused a massive spike on wikidata replicas: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=10&fullscreen&orgId=1&var-server=db1109&var-network=eth0&from=1519578882341&to=1519665282341

As per https://wikitech.wikimedia.org/wiki/Server_Admin_Log the time matches the deployment of https://gerrit.wikimedia.org/r/#/c/413724/

https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=10&fullscreen&orgId=1&var-server=db1109&var-network=eth0&from=1519578882341&to=1519665282341

Ouch. Please feel free to revert the config change https://gerrit.wikimedia.org/r/413724 – even if it turns out to be unrelated, it won’t hurt our users too much.

@Ladsgroup will be back in an hour, but he recommends that you first revert my (Lucas’) change (above), and if that doesn’t help, it might be his change https://gerrit.wikimedia.org/r/414667 (or https://gerrit.wikimedia.org/r/414654 – I’m not 100% sure which one he meant, but almost certainly not https://gerrit.wikimedia.org/r/414650). HTH

We have reverted the change

And looks like everything is back to normal: https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=10&fullscreen&orgId=1&var-server=db1109&var-network=eth0&from=now-1h&to=now&refresh=1m
We also killed queries (and they are not coming back after the revert)

To clarify – SQL queries? Because I can’t think of anything SQL-related in that change… it shouldn’t result in any new SQL queries, and should indirectly have reduced the number of SQL queries (because some Wikidata items would no longer need to be loaded). But perhaps there’s something silly in the code…

This deserves an incident report.

greg added a project: Wikimedia-Incident.Feb 26 2018, 5:54 PM

greg moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.

Screenshot_20180226_185900.png (940×1 px, 158 KB)

Because I can’t think of anything SQL-related in that change… it shouldn’t result in any new SQL queries

Okay, that was wrong. There are new SQL queries: for each wbcheckconstraints request, one or two queries to get the latest revision ID for a set of entity IDs (one for a cache hit to see if the cached result is still valid, one for a cache miss to add the entity IDs to the cache, and it’s possible for both to occur in one request if the cached result is invalid), via WikiPageEntityMetaDataLookup. And I think that set isn’t explicitly limited to any particular size anywhere, which is probably bad.

~~Do you have the query text of the killed queries? Were they queries like this code could produce – joining page, revision, and text?~~ yup (hadn’t seen @jcrespo’s comment when I wrote this)

Something like:

SELECT /* Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::selectRevisionInformationMultiple */ rev_id, rev_content_format, rev_timestamp, page_latest, page_is_redirect, old_id, old_text, old_flags, page_title FROM `page` INNER JOIN `revision` ON ((page_latest=rev_id)) INNER JOIN `text` ON ((old_id=rev_text_id)) /* db1109 wikidatawiki 54s */

Okay, then it’s almost certainly caused by this change, yes :( and I’m going to guess that the most likely reason for this query to become so slow would be that we’re asking for way too many revision IDs. Can you perhaps confirm this? The WHERE would have something like one Q1234=page_title AND 0=page_namespace per entity ID.

I hope that this is just some freaky outlier, and that the cached constraint check results for most items don’t require too many revision IDs. In that case, we could disable the caching for all results that depend on too many entity IDs.

The WHERE would have something like one Q1234=page_title AND 0=page_namespace per entity ID.

An alternative way to confirm this would be to check the latestRevisionIds in a cached value… perhaps someone can look at the cached value for WikibaseQualityConstraints::checkConstraints::v2::Q42::en, specifically the length of its latestRevisionIds member?

Okay, so it’s actually the opposite. We’re somehow asking for the latest revision IDs of an empty array of entity IDs. And WikiPageEntityMetaDataLookup has a special safeguard for this:

		if ( empty( $where ) ) {
			// If we skipped all entity IDs, select nothing, not everything.
			return '0';
		}

And here’s what Database::selectSQLText does with $conds => '0'.

		if ( !empty( $conds ) ) {
			if ( is_array( $conds ) ) {
				$conds = $this->makeList( $conds, self::LIST_AND );
			}
			$sql = "SELECT $startOpts $vars $from $useIndex $ignoreIndex " .
				"WHERE $conds $preLimitTail";
		} else {
			$sql = "SELECT $startOpts $vars $from $useIndex $ignoreIndex $preLimitTail"; // YOU ARE HERE
		}

So we were actually asking for all of the revision IDs. All of them. Some safeguard…

I thought you had posted/screenshoted reduced versions of the query, and omitted the WHERE part because it could contain sensitive information or whatever… no, the query just didn’t have any WHERE, that’s why it was so expensive.

Lucas_Werkmeister_WMDE lowered the priority of this task from Unbreak Now! to High.Feb 26 2018, 6:52 PM

SELECT $startOpts $vars $from $useIndex $ignoreIndex " .
				"WHERE $conds $preLimitTail";

That looks like a recipe for an sql injection, how did this pass security review?

That looks like a recipe for an sql injection, how did this pass security review?

Sorry, I should’ve been clearer – this is in Database::selectSQLText. That’s the one place where code like this would be expected, right? (Assuming that all the variables have been sanitized properly before.)

ok, sorry, for a second I thought you were executing that- on that code is guaranteed to be sanitized.

In T184812#4002850, @jcrespo wrote:
SELECT $startOpts $vars $from $useIndex $ignoreIndex " .
				"WHERE $conds $preLimitTail";
That looks like a recipe for an sql injection, how did this pass security review?

Why this is public?

Ladsgroup set Security to Software security bug.Feb 26 2018, 7:10 PM

Ladsgroup added a project: acl*security.

Ladsgroup changed the visibility from "Public (No Login Required)" to "Custom Policy".

Ladsgroup removed a project: acl*security.Feb 26 2018, 7:15 PM

Ladsgroup changed the visibility from "Custom Policy" to "Public (No Login Required)".

Restricted Application added a project: acl*security. · View Herald TranscriptFeb 26 2018, 7:15 PM

I’ve started writing up an incident report, but I can’t currently save it since wikitech is read-only due to a database migration. If it’s not back to read-write by the time I leave the office, I’ll dump it in an Etherpad or something.

Edit: https://etherpad.wikimedia.org/p/Incident_documentation_20180226_WBQC

Php considering "0" to be false strikes again!

Wikitech is writable again, here’s the report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180226-WikibaseQualityConstraints

I’ll ask @Ladsgroup to send out the email to ops@ tomorrow (AFAIK I don’t have the required permissions for that, and anyways it’s probably good to have another set of eyes on that document before it goes out).

Well, it seems wikitech is completely down now.

Sent to ops-l

Lucas_Werkmeister_WMDE mentioned this in T188386: Delay rollout of constraint violation gadget to all users?.Feb 27 2018, 11:31 AM

Lucas_Werkmeister_WMDE moved this task from Done to In Progress on the Wikidata-Sprint-2018-02-14 board.Feb 27 2018, 1:07 PM

WMDE-leszek added a project: Wikidata-Sprint-2018-02-28.Feb 27 2018, 1:27 PM

WMDE-leszek moved this task from Backlog to In Progress on the Wikidata-Sprint-2018-02-28 board.

Lucas_Werkmeister_WMDE closed subtask T188311: Don’t pass empty entity ID lists to WikiPageMetaDataAccessor as Resolved.Feb 28 2018, 1:03 PM

Several of the actionables have been implemented; the main one was merged just before the wmf.23 branch cut, others I’ve now requested to be backported for the European Mid-Day SWAT today. If it’s okay, I’d like to try enabling caching again in today’s Morning SWAT…

Can you wait on T188505 ? I don't think there is a reason why we should rush a deployment after it failed without a proper investigation. CC @greg

Okay… is there anything I can do to help move that task forward?

The reason I want to get caching into place soon is that we’ll roll out constraint checks to more users in the next weeks (starting on March 1st to all usernames beginning with “Z”, see T184069#4000941 for details), so we will start to see more constraint check requests. I think we can bear the first group (just “Z”) without caching, but after that we would probably have to pause the rollout if we can’t get caching working by then. (See also T188386: Delay rollout of constraint violation gadget to all users?.)

So the reason to rush is not that other things are broken, but that you want to deploy more things?

is there anything I can do to help move that task forward?

The ticket is open for anyone to help on it, the more help, the faster it can be closed.

@Lydia_Pintscher I consider this is being pushed unnecessarily fast after an actual outage happend, without proper investigation of the causes leading to them-- that looks like a disregard for the site's reliability, and I do not like it. There is a specific, very small actionable I pointed as a blocker to continue the deployment: T188505

In T184812#4010371, @Lucas_Werkmeister_WMDE wrote:

Okay… is there anything I can do to help move that task forward?

The reason I want to get caching into place soon is that we’ll roll out constraint checks to more users in the next weeks (starting on March 1st to all usernames beginning with “Z”, see T184069#4000941 for details), so we will start to see more constraint check requests. I think we can bear the first group (just “Z”) without caching, but after that we would probably have to pause the rollout if we can’t get caching working by then. (See also T188386: Delay rollout of constraint violation gadget to all users?.)

I don't think we should be rushing things here, we already had a close call on Monday.
We were pretty pretty pretty close to saturate the network interfaces of the databases.
Our logtash logged thousands of errors during the time the change was in production, and again, we were pretty close to completely saturate the servers, which could have ended up in a serious outage (throwing errors to 100% of the requests) and probably also overloading the master.

If we are going to keep rolling new things this fast, I would consider investigating and fixing T188505 a must. Why? Basically to be on the safe side, in case another bad deployment happens. It is better to kill queries than get all the servers overloaded and suffer a major outage for wikidata.

Hey,
We totally understand T188505 is a blocker for further work on caching-side. It was more of a question of how we should proceed with the deployment of something else that might get affected by caching not enabled. If we don't enable caching, other problems might arise or might not, we are not sure. It's a rather complex situation. This wasn't a discussion about whether T188505 should be a blocker of enabling caching for constraint checks or not. If you think there was any sort of push, I can assure you, it was a misunderstanding and it's not our intention and I apologize if that was the impression.

In T184812#4011213, @Ladsgroup wrote:

Hey,
We totally understand T188505 is a blocker for further work on caching-side. It was more of a question of how we should proceed with the deployment of something else that might get affected by caching not enabled. If we don't enable caching, other problems might arise or might not, we are not sure. It's a rather complex situation. This wasn't a discussion about whether T188505 should be a blocker of enabling caching for constraint checks or not.

Given how big the problem was, I am a bit concerned about deploying more things to be honest. Specially if it is such a complex situation that we don't really know which things could pop up.
The idea of investigating and fixing T188505 is more to have a safety net in case unexpected things come up, at least we will be killing bad queries and preventing those to bring down the whole wikidata infra.

If there are such big doubts about what could/could not happen if we do/don't enable caching, from my point of view, we should hold any new deployment related to this until we have fixed the mentioned killing query task.

If you think there was any sort of push, I can assure you, it was a misunderstanding and it's not our intention and I apologize if that was the impression.

Thanks for the clarification - we were basically trying to protect the whole wikidata servers :-)

In T184812#4010399, @jcrespo wrote:

@Lydia_Pintscher I consider this is being pushed unnecessarily fast after an actual outage happend, without proper investigation of the causes leading to them-- that looks like a disregard for the site's reliability, and I do not like it. There is a specific, very small actionable I pointed as a blocker to continue the deployment: T188505

Yeah indeed. Sorry. Bad timing. Looking into delaying now.

greg moved this task from Active Situation to Follow-up prevention on the Wikimedia-Incident board.Mar 1 2018, 11:37 PM

Lucas_Werkmeister_WMDE closed subtask T188312: Don’t pass overly long entity ID lists to WikiPageEntityDataAccessor as Resolved.Mar 2 2018, 11:10 AM

Lucas_Werkmeister_WMDE added a project: Wikidata-Ministry-Of-Magic.Mar 6 2018, 10:43 AM

Lucas_Werkmeister_WMDE moved this task from Tasks to In Progress on the Wikidata-Ministry-Of-Magic board.

jcrespo closed subtask T188505: Investigate why query killer didn't kill 1-hour long queries as Resolved.Mar 6 2018, 4:34 PM

Apologies for being pushy earlier – I definitely want the Wikidata servers to stay alive as well :)

The bug that caused the incident should be thoroughly fixed now, with several commits already in wmf.23 (either before the cut or backported) and some more in wmf.24, and you managed to fix the query killer as well (thanks a lot!). Would it be okay if we try to enable caching again after the next train – perhaps on Monday? I promise I’ll keep a very close look on the Grafana boards to see if anything explodes ;)

Yes, I said I wanted to block it on the ticket I mentioned before, which is now resolved.

Change 416748 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/mediawiki-config@master] Enable caching of constraint check results

https://gerrit.wikimedia.org/r/416748

Alright, I’ve added it to next Monday’s EU SWAT.

Change 416748 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable caching of constraint check results

https://gerrit.wikimedia.org/r/416748

Mentioned in SAL (#wikimedia-operations) [2018-03-12T13:17:22Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 03m 09s)

Mentioned in SAL (#wikimedia-operations) [2018-03-12T13:31:34Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 03m 08s)

Mentioned in SAL (#wikimedia-operations) [2018-03-12T13:59:57Z] <zfilipin@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 00m 57s)

Deployed, and so far the servers seem to be happy, but I’ll check again in a couple of hours before closing this.

Everything still okay as far as I can tell. I think we can close this.

Liuxinyu970226 unsubscribed.Mar 13 2018, 11:50 AM

Lydia_Pintscher moved this task from In Progress to Done on the Wikidata-Ministry-Of-Magic board.Mar 14 2018, 12:29 PM

Lucas_Werkmeister_WMDE closed subtask T185688: Constraint check results are cached independent of language as Resolved.Mar 27 2018, 10:43 AM

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Lucas_Werkmeister_WMDE claimed this task.Jun 15 2018, 10:13 AM

Lucas_Werkmeister_WMDE raised the priority of this task from Lowest to High.