Implement logic to filter bogus GTIDs
Closed, ResolvedPublic0 Estimated Story Points
Actions

Description

There is a mismatch between replication lag that is being reported by MariaDB tooling and MW tooling. One possible cause of this is that MW tooling takes into account so-called bogus GTIDs which are materialised in a variety of ways (e.g. maintenance work on a server in the non-master DC). In order to mitigate that, we should implement the logic to filter out the bogus IDs.

Details

Subject	Repo	Branch	Lines +/-
rdbms: make DatabaseMysqlBase::masterPosWait() logging clearer about GTID filtering	mediawiki/core	master	+33 -13
rdbms: fix active GTID filtering in DatabaseMysqlBase	mediawiki/core	master	+86 -33
rdbms: add ILoadBalancer::getReplicaResumePos method	mediawiki/core	master	+176 -24

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Declined	aaron	T221159 FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW
Resolved	aaron	T224422 Implement logic to filter bogus GTIDs

Event Timeline

• mobrovac triaged this task as High priority.May 27 2019, 11:23 AM

• mobrovac created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2019, 11:23 AM

• mobrovac added a subtask: T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.May 27 2019, 11:24 AM

• kchapman moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.May 28 2019, 7:55 PM

• mobrovac removed a subtask: T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.May 31 2019, 4:12 AM

• mobrovac added a parent task: T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.

aaron claimed this task.Jun 18 2019, 3:36 PM

Change 517778 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: add ILoadBalancer::getReplicaResumePos method

https://gerrit.wikimedia.org/r/517778

gerritbot added a project: Patch-For-Review.Jun 19 2019, 1:27 AM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Jun 27 2019, 9:07 PM

Change 517830 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix active GTID filtering in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/517830

Change 517778 merged by jenkins-bot:
[mediawiki/core@master] rdbms: add ILoadBalancer::getReplicaResumePos method

https://gerrit.wikimedia.org/r/517778

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.14; 2019-07-16).Jul 12 2019, 6:00 PM

Krinkle moved this task from Rdbms library to Rdbms: MySQL support on the MediaWiki-libs-Rdbms board.Jul 23 2019, 5:35 PM

CCicalese_WMF removed a project: Platform Team Legacy (Watching / External).Jul 26 2019, 5:50 PM

WDoranWMF moved this task from Multi-DC (TEC1) to mop on the Platform Engineering board.Jul 26 2019, 6:47 PM

WDoranWMF edited projects, added Platform Team Initiatives (Multi-DC (TEC1)); removed Platform Engineering (Multi-DC (TEC1)).

Restricted Application added a project: Platform Engineering. · View Herald TranscriptJul 26 2019, 6:47 PM

WDoranWMF removed a project: Platform Engineering.Jul 26 2019, 6:49 PM

• mobrovac added a project: Platform Team Workboards (Clinic Duty Team).Aug 26 2019, 12:38 PM

• mobrovac moved this task from Inbox to External Code Review Needed on the Platform Team Workboards (Clinic Duty Team) board.

WDoranWMF moved this task from External Code Review Needed to External Code Review Completed on the Platform Team Workboards (Clinic Duty Team) board.Aug 30 2019, 1:12 PM

• mobrovac moved this task from External Code Review Completed to External Code Review In Progress on the Platform Team Workboards (Clinic Duty Team) board.Sep 25 2019, 12:02 PM

• mobrovac moved this task from External Code Review In Progress to External Code Review Completed on the Platform Team Workboards (Clinic Duty Team) board.Sep 25 2019, 12:23 PM

Change 517830 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix active GTID filtering in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/517830

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.25; 2019-10-01); removed MW-1.34-notes (1.34.0-wmf.14; 2019-07-16).Sep 25 2019, 1:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2019, 1:10 PM

aaron moved this task from To-do: Goals, prioritized next 4 Quarters to Doing (old) on the Performance-Team board.Sep 30 2019, 8:42 PM

aaron closed this task as Resolved.Oct 6 2019, 1:38 AM

See my comment on patch, this may not have worked?

I have been taking a look at the comment on the patch, maybe it is easier to follow up here.
By checking the most recent error, which is:

db_server	   	10.64.0.96
 sql	   	SELECT MASTER_GTID_WAIT('171978924-171978924-268029771', 1)
normalized_message	   	Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-268029771,180359242-180359242-170963125

So 171978924 is db1109 (current s8 master), and the server that erroed was db1126 (s8 slave) so that makes sense.
However 171970645 is db1104 (old s8 master), 171978778 isn't a current host, but the IP is 10.64.48.26 which correlates to the old decommissioned s8 master (https://gerrit.wikimedia.org/r/#/c/operations/dns/+/540765/1/templates/wmnet).

So there is definitely something going on there - it shouldn't be looking for old those masters, only 171978924 would make sense.

This is blocking https://gerrit.wikimedia.org/r/525147.

@jcrespo wrote on Gerrit:

I would suggest to pause this deploy until the gtid filtering issue gets researched.

So this is my proposal- GTID get "infected" from the master, but they were able to be at least cleared. I want to depool a host where this happens a lot, db1121, remove its gtids, reconnect it to master and see if this continues happening (I was able to reproduce it on db1121 with the current gtid pos). The rationale is that for gtid positions not waited, it waits on some existing only on a replica but not on the new master. Thoughts?

In T224422#5562296, @jcrespo wrote:

So this is my proposal- GTID get "infected" from the master, but they were able to be at least cleared. I want to depool a host where this happens a lot, db1121, remove its gtids, reconnect it to master and see if this continues happening (I was able to reproduce it on db1121 with the current gtid pos). The rationale is that for gtid positions not waited, it waits on some existing only on a replica but not on the new master. Thoughts?

+1
I remember having to manually delete GTIDs from the mysql tables when troubleshooting the issue with gtid_domain_id on multisource hosts as reset slave all wouldn't get those cleaned up - not sure if it would still be the case after MariaDB fixed that bug. If you manage to get them totally cleaned up, please paste here the procedure so we can document it :-)

Also, I would wait for tomorrow before depooling a host, as we have many depooled now for the PDU maintenance (or wait until the PDU maintenance is done, which I would expect it to be finished by 12 or 13 UTC)

I checked db1084 execution and local execution seem to work as intended:

root@db1084[(none)]> SELECT MASTER_POS_WAIT('db1138-bin.000467', 303912456, 1);
+----------------------------------------------------+
| MASTER_POS_WAIT('db1138-bin.000467', 303912456, 1) |
+----------------------------------------------------+
|                                               NULL |
+----------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('0-180359175-3368394787,171978876-171978876-109300035,171966557-171966557-578966402,180359190-180359190-192195477,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050', 1);
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MASTER_GTID_WAIT('0-180359175-3368394787,171978876-171978876-109300035,171966557-171966557-578966402,180359190-180359190-192195477,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050', 1) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                                                -1 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.00 sec)

root@db1084[(none)]> show global variables like 'gtid%';
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name          | Value                                                                                                                                                                                                      |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gtid_binlog_pos        | 171978876-171978876-108546562                                                                                                                                                                              |
| gtid_binlog_state      | 171978876-171978876-108546562                                                                                                                                                                              |
| gtid_current_pos       | 0-180359175-3368394787,171966557-171966557-578966402,171970589-171970589-201132050,171978775-171978775-4822899280,171978876-171978876-108546562,180359175-180359175-43143523,180359190-180359190-192195477 |
| gtid_domain_id         | 171970662                                                                                                                                                                                                  |
| gtid_ignore_duplicates | OFF                                                                                                                                                                                                        |
| gtid_slave_pos         | 0-180359175-3368394787,171966557-171966557-578966402,171970589-171970589-201132050,171978775-171978775-4822899280,171978876-171978876-108546562,180359175-180359175-43143523,180359190-180359190-192195477 |
| gtid_strict_mode       | OFF                                                                                                                                                                                                        |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-108546562', 1);                             
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-108546562', 1) |
+------------------------------------------------------+
|                                                    0 |
+------------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-100850438', 1);
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-100850438', 1) |
+------------------------------------------------------+
|                                                    0 |
+------------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-108546562', 1);
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-108546562', 1) |
+------------------------------------------------------+
|                                                    0 |
+------------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-108546563', 1);
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-108546563', 1) |
+------------------------------------------------------+
|                                                   -1 |
+------------------------------------------------------+

However, I just noticed that while manuel log is rational, other logs on db1084 are not:

normalized_message:
Timed out waiting for replication to reach 171966557-171966557-578966402,171978775-171978775-4822899280,171978876-171978876-100850438
Query:
SELECT MASTER_GTID_WAIT('171978876-171978876-100850438', 1)

I thought the logs and query where missaligned, but assuming the log is before filtering and the query is after filtering, it is working as inteded. But we still have a large number of errors, are we waiting on the right position? Do we have lag > 1s on so many servers?

I think I need more metrics (e.g. percentage of executions of chronology protector vs. successful executions) because either this didn't work, its measurements are not accurate for what we really want to measure, or it worked and displayed worse lag issues than we thought.

daniel mentioned this in T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.Oct 10 2019, 9:39 AM

So this is the latest news:

On T234948 we found that special page updates can lead to a constant 1-2 seconds of lag for the confluence of large long running query with heavy load from the many s8 edits + ongoing wikibase refactoring. Because the timeout is only 1 second, this means chronology protector will fail frequently. However, because we allow pooling for I think a limit of 5 seconds, the host is not depooled. Also, some jobs probably use the "slow" replica and try to use chronology protector, which leads to lots of errors. Obviously, increasing the timeout cannot be done lightly, as it could cause worse problems.

The good news is that I believe this was deployed successfully and works as intended. The bad news is that there is a weakness on our infrastructure, but I am not sure how to move forward, need suggestions. We can open discussion on another ticket, as this scope seems solved.

Indeed the logging is based on the *whole* raw unfiltered position...I should add a logstash key for the filtered one too.

Change 542268 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make DatabaseMysqlBase::masterPosWait() logging clearer about GTID filtering

https://gerrit.wikimedia.org/r/542268

gerritbot added a project: Patch-For-Review.Oct 11 2019, 12:08 AM

Change 542268 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make DatabaseMysqlBase::masterPosWait() logging clearer about GTID filtering

https://gerrit.wikimedia.org/r/542268

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.2; 2019-10-15).Oct 11 2019, 2:00 AM

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2019, 2:10 AM

In T224422#5562481, @jcrespo wrote:

I think I need more metrics (e.g. percentage of executions of chronology protector vs. successful executions) because either this didn't work, its measurements are not accurate for what we really want to measure, or it worked and displayed worse lag issues than we thought.

I've filed T235244 for this.

In T224422#5563013, @jcrespo wrote:

So this is the latest news:

On T234948 we found that special page updates can lead to a constant 1-2 seconds of lag for the confluence of large long running query with heavy load from the many s8 edits + ongoing wikibase refactoring. Because the timeout is only 1 second, this means chronology protector will fail frequently. However, because we allow pooling for I think a limit of 5 seconds, the host is not depooled. Also, some jobs probably use the "slow" replica and try to use chronology protector, which leads to lots of errors. Obviously, increasing the timeout cannot be done lightly, as it could cause worse problems.

The good news is that I believe this was deployed successfully and works as intended. The bad news is that there is a weakness on our infrastructure, but I am not sure how to move forward, need suggestions. We can open discussion on another ticket, as this scope seems solved.

This same issue happened today (same host - db1087) and will most likely keep happening as the updateSpecialPages.php script runs heavy queries)

Marostegui mentioned this in T235265: Slow query on SpecialMostLinked creating lag on wikidata slaves.Oct 11 2019, 1:40 PM

In T224422#5563013, @jcrespo wrote:

So this is the latest news:

On T234948 we found that special page updates can lead to a constant 1-2 seconds of lag for the confluence of large long running query with heavy load from the many s8 edits + ongoing wikibase refactoring. Because the timeout is only 1 second, this means chronology protector will fail frequently. However, because we allow pooling for I think a limit of 5 seconds, the host is not depooled. Also, some jobs probably use the "slow" replica and try to use chronology protector, which leads to lots of errors. Obviously, increasing the timeout cannot be done lightly, as it could cause worse problems.

The good news is that I believe this was deployed successfully and works as intended. The bad news is that there is a weakness on our infrastructure, but I am not sure how to move forward, need suggestions. We can open discussion on another ticket, as this scope seems solved.

I'll take a peek at the timeout values.

I apologize for assuming gtid changes were to blame- in my defense logging was confusing to me. On the bad news side, I believe there is still some underlying issue showing "fake" lag/creating extra waits, due to some kind of contention on the mysql or application layer. Right now this is most visible on the job queue, so not sure timeout is the right way to go, it could be just some concurrency configuration to "smooth" mysql queries/connections, or maybe some migration or other ongoing issue (e.g. could be a fallout of the application server latency issue we have observed in the past).

Sadly I have not yet been able to pinpoint a sure cause. :-(

This is what I mean:

Screenshot_20200203_084038.png (967×1 px, 102 KB)

	F31545769: Screenshot_20200203_084038.png
	Feb 3 2020, 7:41 AM

Implement logic to filter bogus GTIDsClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement logic to filter bogus GTIDs
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...