Page MenuHomePhabricator

fundraising database replication lag master thread
Closed, ResolvedPublic

Description

The fundraising DBs are displaying some erratic behavior around replication. Sometimes one or more slaves will fall behind and refuse to catch up. Neither flushing tables nor restarting slave threads has any effect but (so far) restarting the server daemon fixes it within minutes. Doing nothing for sufficient time also seems to work; today frdb1002 was lagging for over 8 hours after English banners were up for an hour. That box is also the read server so perhaps read locks made it worse.

Ganglia metrics make it look like something happened in late June that has had all the DB servers basically out of memory since then.

My best guess at this point is that single-threaded replication is simply not fast enough to keep up with the master under load. The only thing that doesn't explain is why restarting the daemon fixes it.

We might consider trying row-based replication, which should be faster, though I'm sure there are caveats.

I will keep investigating and post findings here.

Adding @Marostegui and @jcrespo because they were helping us poke at it.

Event Timeline

Hi,

Did you guys have time to set up the graphs we requested so we can better check the problems?

Some facts that we saw during the last troubleshooting:

  • Setting innodb_flush_log_at_trx_commit = 0 didn't help, so it might not be disk/replication related
  • We observed that the number of dirty pages was really really high during the issue. I suggested we'd change the innodb_max_dirty_pages_pct from 75 to maybe 40 in one host, to see how that would would behave in comparison with the others. I believe that in the end, as the problem got fixed, you guys didn't want to change it until it happened again. So maybe it is a good time to do so.

If the problem is the high number of dirty pages, it would explain why restarting the daemon fixes it: because they all get flushed and MySQL starts again with no dirty pages.
It would also explain why after some time it recovers, because they are all flushed.

The issue again, matches a high number of dirty pages:

graph.png (373×747 px, 28 KB)

It matches again the spikes in writes for that host too.

Last night I ran into the first instance where restarting the daemon did NOT cause the lag to drop, on frdb2001. After about an hour it went down on its own. Not that restarting was ever a viable fix, but there's why.

@Marostegui - Thanks for the information. Prometheus on frack is almost ready for prime time, maybe today if there aren't too many fires. In the mean time I found this set illustrative yesterday: https://ganglia.wikimedia.org/latest/?r=hour&cs=08%2F16%2F2017+16%3A00&ce=08%2F16%2F2017+19%3A00&m=cpu_report&c=Fundraising+eqiad&h=frdev1001.frack.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS - the dip is when I restarted mysqld

Excessive dirty pages does look like a likely culprit, or at least a contributing factor to the lag. I will try changing it and see what happens next time.

I see lots of inserts happening at the time, are you creating (or doing it implicitly) transactions with large insert batches? That would explain if mysql would be delayed while that gets transmitted and executed again on the replicas, and it would mean it is application-caused (sort-of, technically it is mysql not being able to transmit large transactions instantly, but replication is not 100% transparent- any distributed system requires small batches to be kept in sync, and cannot transmit those mid-transaction, because they may never be commited/fail).

I would like to see the hosts by myself and judge the cause looking at the ongoing activity while when the lag first starts- which hosts are normally affected?.

Prometheus on frack is almost ready for prime time, maybe today if there aren't too many fires.

Assuming it is not shared with the other production hosts, you can literally install mysqld_exporter and copy the mysql dashboards: https://grafana.wikimedia.org/dashboard/db/mysql

Also parralel replication can help, if you are on 10.0 and configure the domain-id.

@jcrespo I have brought the idea up with @Ejegg to scan the most likely jobs for long transactions, sounds like a good thing to do regardless

The master is frdb1001 and the replicas are frdev1001, frdb1002, and frdb2001. frdb1002 is also the read server. @Jgreen and I are the only ones who get paged for this replag but I would be happy to alert you on IRC next time it happens so you can take a look

We have a local repo of packages for frack hosts so I have to build prometheus-mysqld-exporter there since there is no jessie candidate, will do that shortly.

Will also look into parallel replication, thank you!

Last night I ran into the first instance where restarting the daemon did NOT cause the lag to drop, on frdb2001. After about an hour it went down on its own. Not that restarting was ever a viable fix, but there's why.

That is a different behaviour from the one you and Jeff were describing last week when we last troubleshooted it together.

I see lots of inserts happening at the time, are you creating (or doing it implicitly) transactions with large insert batches? That would explain if mysql would be delayed while that gets transmitted and executed again on the replicas, and it would mean it is application-caused (sort-of, technically it is mysql not being able to transmit large transactions instantly, but replication is not 100% transparent- any distributed system requires small batches to be kept in sync, and cannot transmit those mid-transaction, because they may never be commited/fail).

I remember that last week when troubleshooting it, I did mention the amount of writes as the first possible explanation after checking the graphs but both @cwdent and @Jeff_G mentioned that not all the servers lagged at the same time. Which is what I would expect if they have the same configuration and HW, which apparently they do.
I also recall they ran the script at that generated that amount of writes that moment and nothing happened.

So this might be a different scenario that the one experienced past week, in which, stopping and start mysqld fixed the issue straightaway, and it didn't this time.

I remember that last week when troubleshooting it, I did mention the amount of writes as the first possible explanation after checking the graphs but both @cwdent and @Jeff_G mentioned that not all the servers lagged at the same time. Which is what I would expect if they have the same configuration and HW, which apparently they do.

@Marostegui Sorry, that was not me. Perhaps you mean someone else.

Jgreen renamed this task from Frack replag master thread to fundraising database replication lag master thread.Sep 22 2017, 12:10 PM

Now that we've worked around the BGP/VPN flap issue and improved our prometheus instrumentation, I'm fairly convinced we're either hitting a mariadb bug or we have a SNAFU somewhere in how the databases are tuned. This is based on these observations:

  • essentially what happens is that one or more slave server will start rapidly falling behind master in terms of master log exec position, while keeping up with read position, and once it starts it takes several hours for the server to make headway, after which fully recovery happens in a matter of minutes
  • there does not appear to be an OS bottleneck, in fact by most metrics the machine is working less during the lag event
  • there are no 'stuck' or otherwise individually long-running queries, rather it looks like the machine just slows down in QPS
  • multiple slaves tuned the same do not exhibit the lag behavior at the same time, rather the problem floats around and if anything seems loosely correlated to mysql uptime
  • the oldest, least-powered, cross-site host (frdb2001) does not behave significantly differently now that we've worked around the flaky cross-site connection
  • the one thing we've found that seems to consistently "fix" slow-replication is restarting mysql, when we do that the host recovers in 3-5 minutes
  • I checked table fragmentation, probably since these hosts were built from dump/restore earlier this year, they don't appear significantly fragmented

We've seen this problem with MariaDB 10.0.30 and 10.0.32, and all hosts are currently running the latter. If there were a newer minor release I would say we should try that on one machine and see how it behaves.

I have a new theory: the civicrm log_* tables for the most part do not have a primary or unique key, and thus innodb would employ an implicit key. Apparently this can have a terrible impact on replication.

Here's an anecdote Eilieen found with a similar case where lag seemed to correlate to upgrading the master to 10.0.23, and went away after rolling back the master to 5.5.50. In this case tables are myisam: https://jira.mariadb.org/browse/MDEV-10530

We ran package updates and restarts on frdb1001, frdb1002, frdev1001, and frdb2001 yesterday, at the same time as we replaced eqiad network hardware. Interestingly there was no significant replication lag during today's banner campaign. This is consistent with the pattern that lag does not seem to happen for a while (days) after a mariadb restart.

Casey noticed that 'open temporary tables' is very high on slaves once they have been up for a few days, typically over 10K open tables. We don't have visibility on what tables are open, and we haven't found a way to close them other than restarting the service.

We know civicrm creates temporary tables and relies on session termination to drop them. It looks like with mixed and statement replication, the master is supposed to inject DROP statements to the binary log when the session closes. That can also happen with row based replication. see https://dev.mysql.com/doc/refman/5.7/en/replication-rbr-usage.html#replication-rbr-usage-temptables

Inspecting binary logs it appears as though DROP statements often happen before CREATE statements, with no DROP statement afterward. (This behavior seems so obviously broken that I think I must be missing something!) I am unable to recreate the problem by manually creating a temporary table and exiting the session, the events are in the right order in the binary log in that case.

Here's an example, note another question is what the X_5839536 vs I_5839536 is about.

at 84215453

DROP TEMPORARY TABLE IF EXISTS civicrm.X_5839536 /* generated by server */

at 84215707

DROP TEMPORARY TABLE IF EXISTS civicrm.I_5839536 /* generated by server */

at 84232885

/* https://civicrm.wikimedia.org/user/1 */ CREATE TEMPORARY TABLE X_5839536

(contact_id int primary key)
ENGINE=HEAP

at 84233091

/* https://civicrm.wikimedia.org/user/1 */ INSERT INTO X_5839536 (contact_id)...

at 84233871

/* https://civicrm.wikimedia.org/user/1 */ INSERT INTO X_5839536 (contact_id)...

These events happened on 10/2 and there is no mention of either table in the binary logs after this point.

Those tables are created in CRM_Mailing_BAO_Mailing::getRecipients() - the order in the code is correct (which seems obvious :-).

I'm digging into when the function is called. It might be we can bypass it - when you say no mention of either table - you mean the pattern continues with similar tables - but not specifically them? It looks like our rate of increase is around 3 open tables created per second.

One more question / clarification @Jgreen - this is what you are seeing in the binary logs on the slaves but not on master?

Those tables are created in CRM_Mailing_BAO_Mailing::getRecipients() - the order in the code is correct (which seems obvious :-).

I'm digging into when the function is called. It might be we can bypass it - when you say no mention of either table - you mean the pattern continues with similar tables - but not specifically them? It looks like our rate of increase is around 3 open tables created per second.

Yes that's right, same pattern with changing table names. In the example above these are the only lines mentioning X_5839536, with ~2 weeks of logs afterward.

One more question / clarification @Jgreen - this is what you are seeing in the binary logs on the slaves but not on master?

I checked the binlogs on the master and on the slave, they look the same.

OK - I have confirmed that part of the code is reached by the omnimail job that retrieves civimails & stores them & there is an easy fix. Those mailings DO related to silverpop mailings & there is some weirdness about how they are stored in the db - ie. they are being deleted & recreated too often. However, the job itself only runs every 30 mins so while I will push a fix so it bypasses the code that creates those tables I don't think it can be causing our persistent increment. Still will push something soon & see if it helps!

OK - I have confirmed that part of the code is reached by the omnimail job that retrieves civimails & stores them & there is an easy fix. Those mailings DO related to silverpop mailings & there is some weirdness about how they are stored in the db - ie. they are being deleted & recreated too often. However, the job itself only runs every 30 mins so while I will push a fix so it bypasses the code that creates those tables I don't think it can be causing our persistent increment. Still will push something soon & see if it helps!

Cool. Looking at the logs I think this is happening wherever we use temp tables and rely on session cleanup, not just tables with that naming scheme.

We are directly dropping those tables in the code - so session cleanup isn't the pattern

We are directly dropping those tables in the code - so session cleanup isn't the pattern

For today's binlogs 98% of the drop-temporary-table statements are accompanied by the "generated by server" comment. So this may be a pattern not explained by the code you're looking at. I copied today's binlogs to frlog1001:/srv/fr-tech/binlogs so you can take a look too.

I added a subtask T178020 for that change - & it's in review now. Looking at this I think the spikes are around about every 30 mins & that is inline with the job being altered. I have been mostly focussed on the baseline of 3 per second but perhaps the spikes are salient too.

Screenshot 2017-10-12 13.41.55.png (357×650 px, 48 KB)

... Looking at this I think the spikes are around about every 30 mins....

That lines up with the pattern in binlogs. Here's a graph for 10/9 of create-temporary-with-out-of-order-drop queries per minute. Note every instance where a temporary table was created and not dropped (or dropped before it was created) has this same naming scheme.

temp_attack.png (228×1 px, 10 KB)

I am ready to deploy a patch the affects that half hourly job & should eliminate temp tables from it
https://gerrit.wikimedia.org/r/#/c/383954/1

Closing this task because the patch appears to have entirely fixed the temp table bloat issue.