Page MenuHomePhabricator

Database 'INSERT' query rate doubled (module_deps regression?)
Closed, ResolvedPublic

Description

db1083 (s1 master, enwiki) 2020-03-05

db1122 (s2 master, e.g. zhwiki, nlwiki, ..) 2020-03-05
Screenshot 2020-05-18 at 23.17.03.png (1×2 px, 535 KB)

According to @jcrespo, these are largely inserts to module_deps table (about 150 per second as of 6 hours ago):

INSERT /* Wikimedia\DependencyStore\SqlModuleDependencyStore::storeMulti  */ INTO `module_deps`

20:11 <jynus> not a huge fan of the ON DUPLICATE KEY UPDATE which I think it is what it is hiting instead of inserting

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This changed recently by these two commits for T113916. However afaik the queries were meant to be identical, so this change, if indeed caused by this, is likely unintended.

This went to normal values, at least in total rate (I haven't checked the kind of queries being sent):

Screenshot_20200306_075551.png (900×2 px, 258 KB)

I wonder if some kind of cache invalidation causes this (I was told something about memcache integration or something)- and if it will happen again. I am a bit ignorant about this functionality, but if this is right, I would consider this rate of writes as a an avalanche, and it would be nice if it could be either be pre-hit/heated or smoother more over time, but I really lack context to say something concrete.

Not a high priority, but the change in pattern of queries was so radical, that I was sure this was going to cause an outage (it didn't). The rate may have only caused more predisposition to lag.

There was a similar rise in INSERTs (with similar queries involved) the 27-28 feb, however that didn't rise any alarms because the top rate was lower (around 50% compared to the last one).

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen=true&panelId=2&from=1582405443632&to=1586876803912&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104

Looks like the slams still happen each week. I've installed the "Train deploys" annotation on this dashboard and indeed it correlates with it. (Each week typically has 3-4 annotations, where the last one is group2 / all wikis).

Screenshot 2020-04-14 at 16.08.32.png (1×2 px, 587 KB)
Last 90 days

And before March, the slams weren't there, which also further suggests it was a regression, possibly due to the DepStore layer having perhaps lost or broken part of the logic.

Screenshot 2020-04-14 at 16.11.13.png (1×2 px, 545 KB)
Last 6 months

Knowing where these come from, and not causing any visible production outage (even if numbers don't look great), while I would not close this ticket, but I would lower the priority so this is a known issue (inefficient storage of the relevant functionality as a regression, which would be nice to fix).

Note that extra inserts, because of our setup, get multiplied by the number of replicas on each replica set, so unnecessary extra writes stack among servers.

Also note the one on the 5Mar was the tallest spike, subsequent ones, while not ideal, are less alarming.

This is the handler views, useful because it proves that not only statements increase, rows affected (updated or inserted) too (e.g. not SQL code with no actual row writes performed):

handler.png (954×2 px, 200 KB)

Krinkle triaged this task as Medium priority.Apr 18 2020, 11:15 PM
Marostegui raised the priority of this task from Medium to High.EditedMay 12 2020, 8:28 AM
Marostegui added a subscriber: brennen.

This happened again after yesterday's release and it is causing lag on some slaves.
The inserts doubled everywhere.

This is s1 (enwiki master)

Captura de pantalla 2020-05-12 a las 10.25.30.png (744×1 px, 207 KB)

And s6 (frwiki, jawiki and ruwiki) master:

Captura de pantalla 2020-05-12 a las 10.26.16.png (721×1 px, 221 KB)

This needs to get fixed as it is now causing lag on certain slaves.

I have confirmed (by looking at binlogs) that it is the same original INSERT as described when the task was firstly created:

INSERT /* Wikimedia\DependencyStore\SqlModuleDependencyStore::storeMulti  */ INTO `module_deps` (md_module,md_skin,md_deps) VALUES ('skins.vector.styles.legacy','vector|ru','[\"../php-1.35.0-wmf.30/resources/src/mediawiki.less/mediawiki.mixins.less\",\"../php-1.35.0-wmf.30/resources/src/mediawiki.skinning/i18n-all-lists-margins.less\",\"../php-1.35.0-wmf.30/resources/src/mediawiki.skinning/i18n-headings.less\",\"../php-1.35.0-wmf.
Jdforrester-WMF raised the priority of this task from High to Unbreak Now!.May 12 2020, 3:24 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

To recap the state before manuel's update- this happened before on a previous deploy, but we determined with performance team that it was "bad performance/undesired behaviour" but not "high priority"- as then it didn't cause user impact.

However, this last time is causing database lag and real production impact on a couple of servers of the s6 section (inserts seen on all sections), and quite some error spam on database errors (webrequests and jobqueue). :-(

If the root cause is not easy to fix, maybe concurrency can be limited somehow to prevent the lag?

@jcrespo It already has a concurrency limit of 1 update for any given module/wiki/language combination, the others don't get a lock and rely on whoever got the lost first to do the write. The only thing we could do more is turn off production wikis or let the train reach fewer wikis at once.

I would not describe this as difficult to fix, apart from that it is difficult to figure out why it changed in the first place. No requirements or inputs have (intentionally) been changed. So the behaviour we are seeing is definitely not understood or intended. This used to work fine.

So the behaviour we are seeing is definitely not understood or intended. This used to work fine.

Strange indeed, let me know -to manuel or me- if we can help debugging. Through the binlogs we can dump the whole write workflow.

@Marostegui noticed a similar spike last week and we thought about filling an UBN. But it got dismissed since it happened the week before and it has been considered normal behavior.

Is there some kind of cache being invalidated constantly because it is shared by two different wiki versions?

I am the train conductor this week. Should I just rollback 1.35.0-wmf.32 which is currently solely on the testwikis or would it be fine to promote to all of group0? ;)

The problematic load only happens on group2 wikis due to the scale and diversity of requests, right? If so, we can at least go to group0 for now.

Looking at the graph and sal:

SinceDescriptionUntilDescription
2020-04-16 18:4515 mins before: all wikis to 1.35.0-wmf.282020-04-17 ~ 18:50??
2020-04-30 12:00later at 13:04: all wikis to wmf.302020-05-01 ~ 14:00???
2020-05-04 16:00all wikis to wmf.302020-05-05 17:0017:35 testswikis to wmf.31
2020-05-07 16:00???2020-05-08 00:30Revert all but testwiki to wmf.30
2020-05-11 17:00all wikis to wmf.312020-05-12 17:00???

At least the two last events have some interest. Reverting wmf.30 from all to testwiki stopped the spam of insert. Yesterday the spam started with wmf.31 being send to all wikis and stopped after 24 hours.

I am not sure how those numbers are any conclusive though. Might be related to a 24 hours ttl somewhere?

Given it has stopped at 17:00, I will process with promoting group 0

I have pushed wmf.32 to group0 at 19:05 UTC. That has not changed the rate of INSERT.

I have pushed wmf.32 to group1 at 19:09 UTC and nothing changed.

Removing from train blockers for now.

However, this last time is causing database lag and real production impact on a couple of servers of the s6 section (inserts seen on all sections), and quite some error spam on database errors (webrequests and jobqueue). :-(

I don't know why we went to group1 with this warning, but please do not proceed to group2.

I went with group 1 since:

  • it seems to be triggered by promoting the rest of the wiki (I guess enwiki?)
  • that has been going on for a few months now
  • it also happens outside of the train window

I have watched the rate after the group 1 promotion and would have rolled back if the rate had raised.

Note that the rate of INSERT has been raising slowly for the last 12 hours and is at roughly 130 qps right now for db1083.

Background. ResourceLoader stores a dependency tree of module files in the module_deps table. For reasons explained in T113916, we're moving this data away from the main databases to avoid DB master writes from GET requests.

The first step toward that, was to move the DB query logic to a separate class, which we could then swap with a different implementation later on. That step was deployed last month and wasn't expected to change behaviour.

We did notice a small increase in INSERTs, but this was considered insignificant. Note that the writes are a one-time thing after a deployment and their number closely relates to how many modules were changed by developers the week prior, as well as when users first request them on any given wiki. It varies from week to week.

By now, we know the change is likely significant and in looking for a probable cause, the only thing I could find is that was different is that we now perform the INSERT queries post-send rather than mid-request. It's possible that more concurrent initial cache misses requests see the stale data on the replica as such, and try their INSERT. The bulk of these are prevented given the (non-blocking) Memc lock we have in place for that. But the overlap may not be perfect, and likely more variance is seen when deferring the write to happen post-send.

The reason there are concurrent HTTP cache misses by the way, likely has to do with the Varnish-to-ATS migration. Until Nov 2019, the traffic was routed like this: esams:front-cacheesams:back-cacheeqiad:back-cacheeqiad:mw-app, it is now esams:front-cacheesams:back-cacheeqiad:mw-app. This means request coalescing is now distributed allowing upto 5 rather than just one at a time. 5 isn't much, but it is still 5x more than 1 - which for N wikis x N modules x N languages/skins does add up.

Next steps. If we have no other options, we could try to back out the merging and post-send logic. However, I'd recommend we first try to simply disable the module_deps system more generally in favour of the simpler KeyValue mode. Let's evaluate on Monday and determine then whether to roll out KeyValue further, or whether to pay the short-term cost of carefully undoing the post-send refactoring.

Change 596739 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/596739

Change 596740 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/596740

Change 597128 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/597128

Change 596739 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/596739

Change 596740 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/596740

Change 597128 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/597128

Change 597161 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/597161

Change 597161 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/597161

Mentioned in SAL (#wikimedia-operations) [2020-05-18T22:02:38Z] <Krinkle> Clear module_deps on hewiki (group1, s7) to monitor regeneration, ref T247028

Change 597171 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Make DepStore write lock specific to the current wiki

https://gerrit.wikimedia.org/r/597171

Krinkle lowered the priority of this task from Unbreak Now! to High.May 18 2020, 11:04 PM

Considering the train unblocked for now. We've rolled out two patches to minimise the impact of the current (legacy) db-writing model. This should bring the inserts back to where they were, or lower.

Change 597171 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Make DepStore write lock specific to the current wiki

https://gerrit.wikimedia.org/r/597171

Thanks for working on this.
This looks better now - still not at the values we had a month ago. Let's see if those huge spikes come back

Captura de pantalla 2020-05-19 a las 6.36.55.png (827×1 px, 184 KB)

I am aiming at pushing 1.35.0-wmf.32 to all wikis on Monday May 25th ~ 13:00 UTC (european train window).

Monday 25th was a no deploy day and I forgot to update this task. 1.35.0-wmf.32 has been rolled on all remaining wikis today at 13:00 UTC and we have seen the rate of INSERT doubling again.

Manuel and Jaime were around for the monitoring / DB expertise.

So this time we didn't get as much lag as we did last time, there are still some traces of lag on s8 (wikidatawiki) but definitely not as user impacting as we had on the previous issue (T247028#6128426).
However, I think it is pure luck that this time we didn't get lag, as the rate of INSERTs doubled on some masters.
Master's graph

s1- db1083:

Captura de pantalla 2020-05-26 a las 18.11.08.png (725×1 px, 157 KB)

s2 - db1122:
Captura de pantalla 2020-05-26 a las 18.11.53.png (727×1 px, 194 KB)

s3 - db1123:
Captura de pantalla 2020-05-26 a las 18.12.29.png (752×1 px, 172 KB)

s4 - db1138:
Captura de pantalla 2020-05-26 a las 18.13.23.png (717×1 px, 191 KB)

s5 - db1100:
Captura de pantalla 2020-05-26 a las 18.13.53.png (721×1 px, 94 KB)

s6 - db1131:
Captura de pantalla 2020-05-26 a las 18.14.21.png (718×1 px, 90 KB)

s7 - db1086:
Captura de pantalla 2020-05-26 a las 18.14.52.png (722×1 px, 85 KB)

s8 - db1109 (hard to see as it has always a very spiky traffic):
Captura de pantalla 2020-05-26 a las 18.14.52.png (722×1 px, 85 KB)

I am not sure how long we can keep deploying with these issues without causing lag because of this sudden INSERTs spikes. Creating lag on the replicas means generating user errors.
We need to try to work on this and reduce the rate of INSERTs even more, otherwise on each train rollout we have high chances of creating lag.

@Marostegui Can you clarify to what extent this is an urgent on-going matter vs something that changed or got worse in 2020?

In any event, the parent task tracks the overall removal of this table (T113916). If we're confident that last weeks patches have brought it back to "normal" rate increases around major deployments, I suggest we close this incident and continue at T113916, where there will be on-going work this/next quarter.

@Marostegui Can you clarify to what extent this is an urgent on-going matter vs something that changed or got worse in 2020?

So the issue is that a week ago we started to see lag because of the increase on spikes and that is what is worrying. If all we had were this spikes, even though they are not ideal, they are not causing any harm, but for some reason we saw the slaves getting behind, and that is what causes user impact. The reason is yet to be known but it could be:

  • maybe more INSERTs.
  • maybe more load coming from the natural usage and growth
  • maybe just a one time thing

Or the combination of all of the above.
If this didn't cause lag, we'd be ok with the spikes (obviously keeping in mind that it is not ideal can puts unnecessary overhead on the hosts), but as soon as they cause lag, that becomes an incident, as users get affected.
Maybe this happened just past week and it was just a one time thing, perhaps, but we definitely need to keep an eye on it.
Luckily today we didn't see that during the train (or at least not as much as we did last week), but again: was it just luck? when will it happen again? We definitely need to keep monitoring this closely during the rollouts.
As I said above, if this becomes more and more often, that means that with every rollout we will generate errors. I am afraid of this becoming another: "it is just normal lag - happens sometimes during the release" and that has the risk of hiding underlying code/performance issues specific to that release.

In any event, the parent task tracks the overall removal of this table (T113916). If we're confident that last weeks patches have brought it back to "normal" rate increases around major deployments, I suggest we close this incident and continue at T113916, where there will be on-going work this/next quarter.

Not sure if we should close this, this is still an issue, as the number of INSERTS does double after the release and it takes hours/days to recover and go back to a normal
This is the situation with enwiki master almost 6 hours after the release:

Captura de pantalla 2020-05-26 a las 19.15.44.png (718×1 px, 89 KB)

eprodromou added a subscriber: eprodromou.

@mark We looked this over, and it seems like the right teams are working on this and a lot of hands are on the ball. If there is something specific you need CPT involved in, let us know by tagging us again with the Platform Engineering tag.

What is the state of this now? Are there any query graphs specific to this table?

I don't know, but at least the spikes we used to see each week no longer exist, and this whole thing will be obsolete once we switch to main stash DB over the next 1-2 months.

From task description
2020-03-05

db1083 (s1 master, enwiki) 2020-03-05

db1122 (s2 master, e.g. zhwiki, nlwiki, ..) 2020-03-05
Screenshot 2020-05-18 at 23.17.03.png (1×2 px, 535 KB)
2020-11-14
a1.png (1×2 px, 147 KB)
b.png (1×2 px, 162 KB)
a2.png (1×2 px, 188 KB)

From my point of view, based on @Krinkle's graphs, we can resolve this.

Is it clear what the underlying problem was and how it was fixed? Just making sure it's not phantom recovery.

We've optimised the queries a fair bit. It wasn't a single change that likely did it, or if it was we don't know. We're not putting a lot of effort into this one right now since we're actively investing in moving this feature to a different backend that isn't backed by core DBs, which thus obsoletes concerns in this area.