Spikes in DB traffic and rows/s reads when reading from new terms store
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• alaa_wmde
	Jul 31 2019, 1:07 PM

Description

The Incident
When we tried to enable reading property terms from new store T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW on production, the reading from the database increased dramatically, as seen here https://grafana.wikimedia.org/dashboard/snapshot/2kuWt7nhGDUdV3rAzYMPFYHMJ3poKf5C?panelId=8&fullscreen&orgId=1

The root cause
Our new PrefetchingPropertyTermLookup and PrefetchingItemTermLookup that are responsible for looking up Item/Property terms (main use case is rendering property/item labels in repo) do not use caching as their equivalent BufferingTermLookup for old store did.

Suggested solution
We have to wrap a cache around PrefetchingPropertyTermLookup and PrefetchingItemTermLookup.

We will use the same that was used for old term index for now MapCacheLRU with same buffer size and test to see if that reduces the increase in rows read per second.

Details

Subject	Repo	Branch	Lines +/-
Put random debug info for T229407	mediawiki/extensions/Wikibase	wmf/1.35.0-wmf.5	+7 -0
Put random debug info for T229407	mediawiki/extensions/Wikibase	master	+7 -0
mediawiki: Stop rebuildItermTerms temporary	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Addshore	T208425 [EPIC] Kill the wb_terms table
Resolved	Ladsgroup	T229407 Spikes in DB traffic and rows/s reads when reading from new terms store
Resolved	Ladsgroup	T236681 SqlEntityInfoBuilder reads and writes from the old term store regardless of the config

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: Wikidata. · View Herald TranscriptJul 31 2019, 1:07 PM

As it is locking migration of terms in production

• alaa_wmde claimed this task.Jul 31 2019, 3:01 PM

• alaa_wmde updated the task description. (Show Details)Jul 31 2019, 3:07 PM

• alaa_wmde updated the task description. (Show Details)Jul 31 2019, 4:03 PM

• alaa_wmde edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

• alaa_wmde moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Before I change the description of this task. Here are my findings. Turns out that MapCacheLRU used inside BufferingTermIndex is only per-object cache, not really per-process. Meaning that two instances of BufferingTermIndex will not actually share the cache (since they create their own instances of MapCachLRU internally).

To test, you can try the running the following inside MW's eval.php:

$cache1 = new \MapCacheLRU( 5 );
$cache2 = new \MapCacheLRU( 5 );
$cache1->set( 'foo', 'is in cache 1' );
$cache2->set( 'bar', 'is in cache 2' );
$cache3 = new \MapCacheLRU( 5 );

var_dump( $cache1->toArray() );

/** output
/var/www/mediawiki/maintenance/eval.php(78) : eval()'d code:1:
array(1) {
  'foo' =>
  string(13) "is in cache 1"
}
*/

var_dump( $cache2->toArray() );
/** output
/var/www/mediawiki/maintenance/eval.php(78) : eval()'d code:1:
array(1) {
  'bar' =>
  string(13) "is in cache 2"
}
*/

var_dump( $cache3->toArray() );
/** output
/var/www/mediawiki/maintenance/eval.php(78) : eval()'d code:1:
array(0) {
}
*/

Since PrefetchingPropertyTermLookup and PrefetchingItemTermLookup already implement such per-object in-memory caching through simple array, then the increase in retrieved queries per second is probably not due to any caching missing in these two classes.

Looking at Grafana board again, it seems that we had those spikes today in our databases many times and not just when we tried enable reading from new store. https://grafana.wikimedia.org/dashboard/snapshot/G4s06avbu28Lp0eYIsItGotaiCWr49oy

Maybe there was other effects that resulted in a compound increase we thought was only caused by reading from new store?

I'm thinking now that we should just try it again tomorrow for a little longer and see what happens, and maybe collect some logs of queries being executed most often if that's possible? @Ladsgroup

In T229407#5382896, @alaa_wmde wrote:

I'm thinking now that we should just try it again tomorrow for a little longer and see what happens, and maybe collect some logs of queries being executed most often if that's possible? @Ladsgroup

Thank you for doing this <3

After second deployment, we can see spikes on database (rows per second, but also db traffic) every two hours.
https://grafana.wikimedia.org/dashboard/snapshot/qsMbl3Y5NG68EyxYmoAqXoa8FuqoxvQZ

Looking at other grafana graphs, we saw this

https://grafana.wikimedia.org/dashboard/snapshot/UZnHzz2Qan7rnDiHZ8OTBQfDJvXk4j9Y

This shows that DispatchingTermBuffer is the place to look at, as most these selects are coming from calls to the base class EntityTermLookupBase::getLabels.

In DispatchingTermBuffer we use the new store for properties, which is implemented in PrefetchingPropertyTermLookup.

The funny thing is that EntityTermLookupBase::getLabels should not result in any queries (which I tested locally for sanity again). The prior call to EntityTermLookupBase::prefetchTerms should. Not sure if that's logging issue here.

The SqlEntityInfoBuilder graph points at queries that must be going to old store wb_terms (for items must be) in calls to SqlEntityInfoBuilder::collectTermsForEntities.

https://grafana.wikimedia.org/dashboard/snapshot/zNYsVXAqiVmBMdJEgl4nPgtzMACdX3RL

As I will not be able to work on it before next Monday. Feel free to pick up and continue.

@Ladsgroup any new findings you wanna add on this here?

• alaa_wmde renamed this task from Wrap new terms store with caching to Spikes in DB traffic and rows/s reads when reading from new terms store.Aug 2 2019, 12:51 PM

In T229407#5387622, @alaa_wmde wrote:

@Ladsgroup any new findings you wanna add on this here?

Yeah, the only thing I found is that this spike is on newItemIdFormatter which doesn't make any sense at all whatsoever. We are changing properties and not items.

The newPropertyIdFormatter is pretty healthy:

It can be that this spike kicks out properties out of cache and they need to rebuild all the time.

Thanks @Ladsgroup for all the investigation so far. Even after looking in hadoop cluster, we cannot identify any specific requests that might be responsible for those spikes. In order to move forward, we will add another layer of caching for property terms for now. If the issue persists or appears again for item terms, then we can invest more time in figuring out the root cause of these spikes.

hoo claimed this task.Aug 13 2019, 10:00 AM

hoo moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

I looked into this a bit and found https://grafana.wikimedia.org/dashboard/snapshot/08uo4Z113gB4aJJet3qVUq1Ock5sfkv6?panelId=1&fullscreen&orgId=1 (filter for SqlEntityInfoBuilder_collectTermsForEntities) very interesting.

Given that SqlEntityInfoBuilder::collectTermsForEntities only user is FullEntityParserOutputGenerator::getEntityInfo, we might want to look into which pages get re-parsed around that time. Another possibility might be a bot running at that interval re-parsing (or more probably, editing) pages.

Moved to stalled since not touched for more than two weeks.

• alaa_wmde edited projects, added Wikidata-Campsite; removed Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Sep 12 2019, 4:44 PM

• alaa_wmde moved this task from Needs Work to Ready to pick up on the Wikidata-Campsite board.

Ladsgroup mentioned this in T231011: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.Sep 26 2019, 5:58 PM

Addshore subscribed.Oct 29 2019, 1:17 PM

Related to T234948 ? (linking 2 tickets)

Addshore added a parent task: T208425: [EPIC] Kill the wb_terms table.Oct 31 2019, 7:38 AM

Addshore moved this task from Ready to pick up to Ready to estimate on the Wikidata-Campsite board.Nov 5 2019, 1:23 PM

In T229407#5615033, @Addshore wrote:

Related to T234948 ? (linking 2 tickets)

I don't think so: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=8&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-24h&to=now

In T229407#5412801, @hoo wrote:

I looked into this a bit and found https://grafana.wikimedia.org/dashboard/snapshot/08uo4Z113gB4aJJet3qVUq1Ock5sfkv6?panelId=1&fullscreen&orgId=1 (filter for SqlEntityInfoBuilder_collectTermsForEntities) very interesting.

Given that SqlEntityInfoBuilder::collectTermsForEntities only user is FullEntityParserOutputGenerator::getEntityInfo, we might want to look into which pages get re-parsed around that time. Another possibility might be a bot running at that interval re-parsing (or more probably, editing) pages.

This is a very valuable insight. We need to consider this for T236681: SqlEntityInfoBuilder reads and writes from the old term store regardless of the config

Lydia_Pintscher changed the task status from Stalled to Open.Nov 12 2019, 1:52 PM

Waiting on T236681: SqlEntityInfoBuilder reads and writes from the old term store regardless of the config as caching entity info stuff might solve this problem

• alaa_wmde moved this task from Ready to estimate to Blocked / Waiting / External on the Wikidata-Campsite board.Nov 12 2019, 1:54 PM

Mentioned in SAL (#wikimedia-operations) [2019-11-14T16:04:20Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase: [[gerrit:550828|Put a layer of APC cache on top of reading wb_terms in SqlEntityInfoBuilder]] (T231011 T229407 T236681), Try II (duration: 00m 56s)

Stashbot mentioned this in T236681: SqlEntityInfoBuilder reads and writes from the old term store regardless of the config.Nov 14 2019, 4:04 PM

The patch decreased the read on the database and work done but spikes are still happening:
https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?panelId=2&fullscreen&orgId=1&from=1573718423955&to=1573761623955

In T229407#5412801, @hoo wrote:

Given that SqlEntityInfoBuilder::collectTermsForEntities only user is FullEntityParserOutputGenerator::getEntityInfo, we might want to look into which pages get re-parsed around that time. Another possibility might be a bot running at that interval re-parsing (or more probably, editing) pages.

There's a slight chance that it's ListeriaBot updating queries.

After eight hours of looking at logs and hadoop queries and MySQL queries and graphs the only thing I can say is that putting stats or logs on FullEntityParserOutputGenerator is the best way to move forward here + there's nothing in edits in wikidata that can correlate to this problem (so the only possible idea it purges done either by jobs or POST purges by bots). There's nothing to look at in client side as the client side doesn't get called (actually I discovered that some parts of client are actually dead code that never gets called).

@Ladsgroup does this help?
https://grafana.wikimedia.org/d/FxKUKqUik/wikibase-parseroutputgenerator

Change 554496 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Stop rebuildItermTerms temporary

https://gerrit.wikimedia.org/r/554496

gerritbot added a project: Patch-For-Review.Dec 4 2019, 12:49 PM

I have been working on this since yesterday. I have a hunch what's going on here. Will write more once I'm certain.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptDec 4 2019, 12:50 PM

Ladsgroup moved this task from Stalled/Waiting to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Dec 4 2019, 12:50 PM

Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Dec 4 2019, 1:15 PM

Maintenance_bot moved this task from Incoming to In progress on the User-Ladsgroup board.

Change 554496 merged by Marostegui:
[operations/puppet@production] mediawiki: Stop rebuildItermTerms temporary

https://gerrit.wikimedia.org/r/554496

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2019, 2:10 PM

Change 554511 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Put random debug info for T229407

https://gerrit.wikimedia.org/r/554511

Change 554513 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.5] Put random debug info for T229407

https://gerrit.wikimedia.org/r/554513

Mentioned in SAL (#wikimedia-operations) [2019-12-04T15:07:32Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-12-04T15:09:47Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407, part II (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2019-12-04T17:50:00Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407, part III (duration: 01m 01s)

Ladsgroup mentioned this in T143870: Some mw snapshot hosts are accessing main db servers.Dec 4 2019, 7:42 PM

Change 554511 abandoned by Ladsgroup:
Put random debug info for T229407

Reason:
Done now

https://gerrit.wikimedia.org/r/554511

Change 554513 abandoned by Ladsgroup:
Put random debug info for T229407

Reason:
Done now

https://gerrit.wikimedia.org/r/554513

Binlogs that @Marostegui gave me has things like this:

REPLACE /* SqlBagOStuff::updateTableKeys RunSingleJob.php@mw1335 */ INTO `pc156` (keyname,value,exptime) VALUES ('wikidatawiki:pcache:idoptions:55899217','<gibberish>...

In timespan of six minutes, there's 17k of praser cache insertion on wikidata from RunSingleJob.php in one of the three PC nodes. This is, to put it mildly, insane. One of the items that got a new PC is Q55864736 that hasn't visited in the last month, has no templates or entity usage. I think it's one of wikibase jobs. I dig deeper now.

Maintenance_bot removed a project: Patch-For-Review.Dec 5 2019, 12:10 PM

Addshore awarded a token.Dec 5 2019, 1:02 PM

dcausse mentioned this in T239931: Reduce the impact of the sanitizer on wikidata.Dec 5 2019, 5:01 PM

So After lots of debugging it turned out that it's CirrusSearchChecker job that is causing this. The time matches perfectly:

CirrusSearch: jobs drops

ParserCache for Wikidata

Rows read in s8:

Some old term store (wb_terms) metrics:

It's not the search index rebuilt, it's actually the index being checked and sanitized, it doesn't need HTML of the page but the code that is generalized for mediawiki tries to render it for wikidata which it should. There { T239931} to tackle it.

There's two options:

Making it slower
Making it not depend on the ParserOutput HTML.

the former is easy, the latter requires some work but maybe no more than a couple of days.

Ladsgroup moved this task from Doing to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Dec 5 2019, 9:37 PM

@Ladsgroup I think we can move this from peer review now, great investigative work!
The question is, should it live in the waiting col on the iteration or on the campsite board? ;)

Addshore removed Ladsgroup as the assignee of this task.Dec 9 2019, 11:02 AM

Addshore moved this task from Peer Review to Stalled/Waiting on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Michael subscribed.Dec 9 2019, 11:37 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-09T14:26:23Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:555941|Disable sanity check cirrus jobs for Wikidata (T239931 T229407)]] (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-12-09T14:34:28Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407, clean up debugging info (duration: 00m 59s)

In T229407#5722626, @Addshore wrote:

@Ladsgroup I think we can move this from peer review now, great investigative work!
The question is, should it live in the waiting col on the iteration or on the campsite board? ;)

I just disabled the job for now, until the migration is done and then we enable it again or fix it properly.

Amazing investigation @Ladsgroup!
If I had to choose from the two options you provided to solve this, I would go for the second: "Making it not depend on the ParserOutput HTML", as it looks like a better long term solution.

Thanks a lot for this work!

The spikes are gone now:

Marostegui awarded a token.Dec 10 2019, 10:42 AM

It will be interesting to see if this changes the size of the parsercache, and also hit / miss rate at all too.

Regarding size of ParserCache. I queried pc255 on the first PC node. it had 975,982 rows yesterday and today it had 975,580. Assuming perfect randomness in sharding, there should be a total of 308,736 rows drop (154,368 PC entries) since yesterday. Even though I don't see the drop in the graphs, maybe it needs more time.

I got my answer, PC gets purged at 1:00 UTC every day only (using mediawiki::maintenance::parsercachepurging job) meaning I have to wait a full day to get the value.

Let's continue discussion in T239931: Reduce the impact of the sanitizer on wikidata. For the sense of the spikes, this is done.

Going to go ahead and close this now.

Maintenance_bot moved this task from In progress to Done on the User-Ladsgroup board.Dec 10 2019, 5:15 PM

Addshore closed subtask T236681: SqlEntityInfoBuilder reads and writes from the old term store regardless of the config as Resolved.Dec 23 2019, 1:03 PM

	F31464880: image.png
	Dec 10 2019, 10:35 AM

	F31464878: image.png
	Dec 10 2019, 10:35 AM

	F31458692: image.png
	Dec 5 2019, 7:18 PM

	F31456793: image.png
	Dec 5 2019, 7:18 PM

Spikes in DB traffic and rows/s reads when reading from new terms storeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Spikes in DB traffic and rows/s reads when reading from new terms store
Closed, ResolvedPublic
Actions

Related Objects
Search...