Page MenuHomePhabricator

Implement logic to filter bogus GTIDs
Closed, ResolvedPublic0 Estimated Story Points

Description

There is a mismatch between replication lag that is being reported by MariaDB tooling and MW tooling. One possible cause of this is that MW tooling takes into account so-called bogus GTIDs which are materialised in a variety of ways (e.g. maintenance work on a server in the non-master DC). In order to mitigate that, we should implement the logic to filter out the bogus IDs.

Event Timeline

mobrovac created this task.

Change 517778 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: add ILoadBalancer::getReplicaResumePos method

https://gerrit.wikimedia.org/r/517778

Change 517830 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix active GTID filtering in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/517830

Change 517778 merged by jenkins-bot:
[mediawiki/core@master] rdbms: add ILoadBalancer::getReplicaResumePos method

https://gerrit.wikimedia.org/r/517778

Change 517830 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix active GTID filtering in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/517830

See my comment on patch, this may not have worked?

I have been taking a look at the comment on the patch, maybe it is easier to follow up here.
By checking the most recent error, which is:

db_server	   	10.64.0.96
 sql	   	SELECT MASTER_GTID_WAIT('171978924-171978924-268029771', 1)
normalized_message	   	Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-268029771,180359242-180359242-170963125

So 171978924 is db1109 (current s8 master), and the server that erroed was db1126 (s8 slave) so that makes sense.
However 171970645 is db1104 (old s8 master), 171978778 isn't a current host, but the IP is 10.64.48.26 which correlates to the old decommissioned s8 master (https://gerrit.wikimedia.org/r/#/c/operations/dns/+/540765/1/templates/wmnet).

So there is definitely something going on there - it shouldn't be looking for old those masters, only 171978924 would make sense.

Krinkle moved this task from Doing (old) to Inbox, needs triage on the Performance-Team board.
Krinkle subscribed.

This is blocking https://gerrit.wikimedia.org/r/525147.

@jcrespo wrote on Gerrit:

I would suggest to pause this deploy until the gtid filtering issue gets researched.

So this is my proposal- GTID get "infected" from the master, but they were able to be at least cleared. I want to depool a host where this happens a lot, db1121, remove its gtids, reconnect it to master and see if this continues happening (I was able to reproduce it on db1121 with the current gtid pos). The rationale is that for gtid positions not waited, it waits on some existing only on a replica but not on the new master. Thoughts?

So this is my proposal- GTID get "infected" from the master, but they were able to be at least cleared. I want to depool a host where this happens a lot, db1121, remove its gtids, reconnect it to master and see if this continues happening (I was able to reproduce it on db1121 with the current gtid pos). The rationale is that for gtid positions not waited, it waits on some existing only on a replica but not on the new master. Thoughts?

+1
I remember having to manually delete GTIDs from the mysql tables when troubleshooting the issue with gtid_domain_id on multisource hosts as reset slave all wouldn't get those cleaned up - not sure if it would still be the case after MariaDB fixed that bug. If you manage to get them totally cleaned up, please paste here the procedure so we can document it :-)

Also, I would wait for tomorrow before depooling a host, as we have many depooled now for the PDU maintenance (or wait until the PDU maintenance is done, which I would expect it to be finished by 12 or 13 UTC)

I checked db1084 execution and local execution seem to work as intended:

root@db1084[(none)]> SELECT MASTER_POS_WAIT('db1138-bin.000467', 303912456, 1);
+----------------------------------------------------+
| MASTER_POS_WAIT('db1138-bin.000467', 303912456, 1) |
+----------------------------------------------------+
|                                               NULL |
+----------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('0-180359175-3368394787,171978876-171978876-109300035,171966557-171966557-578966402,180359190-180359190-192195477,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050', 1);
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MASTER_GTID_WAIT('0-180359175-3368394787,171978876-171978876-109300035,171966557-171966557-578966402,180359190-180359190-192195477,171978775-171978775-4822899280,180359175-180359175-43143523,171970589-171970589-201132050', 1) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                                                -1 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.00 sec)

root@db1084[(none)]> show global variables like 'gtid%';
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name          | Value                                                                                                                                                                                                      |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gtid_binlog_pos        | 171978876-171978876-108546562                                                                                                                                                                              |
| gtid_binlog_state      | 171978876-171978876-108546562                                                                                                                                                                              |
| gtid_current_pos       | 0-180359175-3368394787,171966557-171966557-578966402,171970589-171970589-201132050,171978775-171978775-4822899280,171978876-171978876-108546562,180359175-180359175-43143523,180359190-180359190-192195477 |
| gtid_domain_id         | 171970662                                                                                                                                                                                                  |
| gtid_ignore_duplicates | OFF                                                                                                                                                                                                        |
| gtid_slave_pos         | 0-180359175-3368394787,171966557-171966557-578966402,171970589-171970589-201132050,171978775-171978775-4822899280,171978876-171978876-108546562,180359175-180359175-43143523,180359190-180359190-192195477 |
| gtid_strict_mode       | OFF                                                                                                                                                                                                        |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-108546562', 1);                             
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-108546562', 1) |
+------------------------------------------------------+
|                                                    0 |
+------------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-100850438', 1);
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-100850438', 1) |
+------------------------------------------------------+
|                                                    0 |
+------------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-108546562', 1);
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-108546562', 1) |
+------------------------------------------------------+
|                                                    0 |
+------------------------------------------------------+
1 row in set (0.00 sec)

root@db1084[(none)]> SELECT MASTER_GTID_WAIT('171978876-171978876-108546563', 1);
+------------------------------------------------------+
| MASTER_GTID_WAIT('171978876-171978876-108546563', 1) |
+------------------------------------------------------+
|                                                   -1 |
+------------------------------------------------------+

However, I just noticed that while manuel log is rational, other logs on db1084 are not:

normalized_message:
Timed out waiting for replication to reach 171966557-171966557-578966402,171978775-171978775-4822899280,171978876-171978876-100850438
Query:
SELECT MASTER_GTID_WAIT('171978876-171978876-100850438', 1)

I thought the logs and query where missaligned, but assuming the log is before filtering and the query is after filtering, it is working as inteded. But we still have a large number of errors, are we waiting on the right position? Do we have lag > 1s on so many servers?

I think I need more metrics (e.g. percentage of executions of chronology protector vs. successful executions) because either this didn't work, its measurements are not accurate for what we really want to measure, or it worked and displayed worse lag issues than we thought.

So this is the latest news:

On T234948 we found that special page updates can lead to a constant 1-2 seconds of lag for the confluence of large long running query with heavy load from the many s8 edits + ongoing wikibase refactoring. Because the timeout is only 1 second, this means chronology protector will fail frequently. However, because we allow pooling for I think a limit of 5 seconds, the host is not depooled. Also, some jobs probably use the "slow" replica and try to use chronology protector, which leads to lots of errors. Obviously, increasing the timeout cannot be done lightly, as it could cause worse problems.

The good news is that I believe this was deployed successfully and works as intended. The bad news is that there is a weakness on our infrastructure, but I am not sure how to move forward, need suggestions. We can open discussion on another ticket, as this scope seems solved.

Indeed the logging is based on the *whole* raw unfiltered position...I should add a logstash key for the filtered one too.

Change 542268 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make DatabaseMysqlBase::masterPosWait() logging clearer about GTID filtering

https://gerrit.wikimedia.org/r/542268

Change 542268 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make DatabaseMysqlBase::masterPosWait() logging clearer about GTID filtering

https://gerrit.wikimedia.org/r/542268

I think I need more metrics (e.g. percentage of executions of chronology protector vs. successful executions) because either this didn't work, its measurements are not accurate for what we really want to measure, or it worked and displayed worse lag issues than we thought.

I've filed T235244 for this.

So this is the latest news:

On T234948 we found that special page updates can lead to a constant 1-2 seconds of lag for the confluence of large long running query with heavy load from the many s8 edits + ongoing wikibase refactoring. Because the timeout is only 1 second, this means chronology protector will fail frequently. However, because we allow pooling for I think a limit of 5 seconds, the host is not depooled. Also, some jobs probably use the "slow" replica and try to use chronology protector, which leads to lots of errors. Obviously, increasing the timeout cannot be done lightly, as it could cause worse problems.

The good news is that I believe this was deployed successfully and works as intended. The bad news is that there is a weakness on our infrastructure, but I am not sure how to move forward, need suggestions. We can open discussion on another ticket, as this scope seems solved.

This same issue happened today (same host - db1087) and will most likely keep happening as the updateSpecialPages.php script runs heavy queries)

So this is the latest news:

On T234948 we found that special page updates can lead to a constant 1-2 seconds of lag for the confluence of large long running query with heavy load from the many s8 edits + ongoing wikibase refactoring. Because the timeout is only 1 second, this means chronology protector will fail frequently. However, because we allow pooling for I think a limit of 5 seconds, the host is not depooled. Also, some jobs probably use the "slow" replica and try to use chronology protector, which leads to lots of errors. Obviously, increasing the timeout cannot be done lightly, as it could cause worse problems.

The good news is that I believe this was deployed successfully and works as intended. The bad news is that there is a weakness on our infrastructure, but I am not sure how to move forward, need suggestions. We can open discussion on another ticket, as this scope seems solved.

I'll take a peek at the timeout values.

I apologize for assuming gtid changes were to blame- in my defense logging was confusing to me. On the bad news side, I believe there is still some underlying issue showing "fake" lag/creating extra waits, due to some kind of contention on the mysql or application layer. Right now this is most visible on the job queue, so not sure timeout is the right way to go, it could be just some concurrency configuration to "smooth" mysql queries/connections, or maybe some migration or other ongoing issue (e.g. could be a fallout of the application server latency issue we have observed in the past).

Sadly I have not yet been able to pinpoint a sure cause. :-(

This is what I mean:

Screenshot_20200203_084038.png (967×1 px, 102 KB)