User Details
- User Since
- Apr 14 2020, 7:57 AM (39 w, 6 d)
- Availability
- Available
- LDAP User
- Kormat
- MediaWiki User
- SShirley (WMF) [ Global Accounts ]
Today
Fri, Jan 15
Thu, Jan 14
s6 codfw progress:
Wed, Jan 13
Mon, Jan 11
root@db2079.codfw.wmnet[wikidatawiki]> select * from information_schema.columns where table_name='revision_actor_temp'; +---------------+--------------+---------------------+--------------------+------------------+----------------+-------------+ | TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_DEFAULT | IS_NULLABLE | +---------------+--------------+---------------------+--------------------+------------------+----------------+-------------+ | def | wikidatawiki | revision_actor_temp | revactor_rev | 1 | NULL | NO | | def | wikidatawiki | revision_actor_temp | revactor_actor | 2 | NULL | NO | | def | wikidatawiki | revision_actor_temp | revactor_timestamp | 3 | | NO | | def | wikidatawiki | revision_actor_temp | revactor_page | 4 | NULL | NO | +---------------+--------------+---------------------+--------------------+------------------+----------------+-------------+
Fri, Jan 8
Thu, Jan 7
Date: Thu Jan 7 14:46:40 2021 +0000
Tue, Jan 5
Dec 11 2020
Dec 9 2020
Startup failure:
$ pytest wmfmariadbpy/test/integration/cli_admin/test_mysqlpy.py Image already up-to-date FATAL: Already running Exit: integration env setup failed: Command 'integration-env build && integration-env start' returned non-zero exit status 1. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! _pytest.outcomes.Exit: integration env setup failed: Command 'integration-env build && integration-env start' returned non-zero exit status 1. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Dec 7 2020
Nevermind; the mysql protocol is not compatible with SNI: https://stackoverflow.com/a/50301712/13706377
Would it be feasible to use SNI on haproxy to select the backend? It looks like it has SNI support.
Dec 3 2020
Dec 2 2020
Dec 1 2020
I'm still getting failures, but it's not clear where the issue is.
Nov 30 2020
Nov 26 2020
I think it's good enough to resolve at this point. Thanks for the reminder :)
Nov 25 2020
(and re-starting heartbeat makes the lag disappear ~instantly)
I've tested it in pontoon — stopping heartbeat on the master causes immediate lag to show up for the entire tree.
The orchestrator config change has been deployed, and the heartbeat tables for pc{1,2,3} have been cleaned up. Other sections will need similar cleanups before orchestrator can manage them properly.
Completed.
This is still occurring for me. From firefox:
Nov 24 2020
It looks like this is veery old. The only mentions in the private repo:
Nov 23 2020
Cleaning up the heartbeat tables in prod is a bit tricky, as there's a lot of cruft, and a mix of STATEMENT vs ROW replication. My suggestion is that we update stale rows to set ts to 0, instead of trying to delete the rows. That makes it easy to filter them out from queries.
Resolved by https://gerrit.wikimedia.org/r/641986
Nov 20 2020
Ways to achieve this:
- Clean up the heartbeat table so that it only contains entries that are supposed to be there (T268336: Cleanup heartbeat.heartbeat on all production instances)
- With this, we can use the oldest entry in the heartbeat table to measure the current lag, because all entries are relevant.
- This will work for both the current situation, circular replication, and active-active.
- Only run pt-heartbeat on primary masters
- With this, we can use the newest entry in the heartbeat table to measure the current lag, because none of the other entries matter.
- This will work for the current situation, but will not for circular replication/active-active, as we're back to having >1 entry that matters.
Considerations:
- A non-primary master instance may be running heartbeat (e.g. all parsercache nodes, section masters in backup DC)
- Ignoring any heartbeat entry generated by the local instance causes all primary masters to show an arbitrary amount of lag, as they will only evaluate stale entries in the heartbeat table.
- Solution must allow circular replication, like we do leading up to DC switchovers
- Solution must not assume primary master is in the MW primary DC.
- This would have broken for the misc sections when MW was in CODFW recently.
- Ideally an instance would display lag relative to it's DC-local master.
- Otherwise if master in the secondary DC is lagging, _all_ instances in that DC will show lag, which just adds a lot of noise for no information gain.
@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters. It also means we can't rely on the section's primary master to be in the $mw_primary DC, as that wasn't true for the misc sections while mw was running in codfw a couple of months ago. So, in short, there's no way to do this using an existing wheel.
Trying to do this in a reasonable fashion doesn't seem possible without T268336: Cleanup heartbeat.heartbeat on all production instances being done first.
root@db2094.codfw.wmnet[(none)]> select * from heartbeat.heartbeat; +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+-------- | ts | server_id | file | position | relay_master_log_file | exec_master_log_pos | shard | datacen +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+-------- | 2018-04-25T13:12:01.000970 | 171966669 | db1075-bin.002735 | 145906585 | NULL | NULL | s3 | eqiad | 2017-04-20T16:15:01.000960 | 171970589 | db1040-bin.002987 | 587360761 | db2019-bin.002532 | 773446073 | s4 | eqiad | 2018-07-18T06:01:44.001480 | 171970637 | db1052-bin.005945 | 479413784 | NULL | NULL | s1 | eqiad | 2020-11-20T12:53:24.001090 | 171970661 | db1083-bin.005754 | 610841608 | NULL | NULL | s1 | eqiad | 2017-01-26T08:16:47.000790 | 171974683 | db1057-bin.002885 | 576316184 | db1052-bin.004556 | 210334207 | s1 | eqiad | 2019-11-14T06:01:21.000750 | 171974720 | db1067-bin.003024 | 445994072 | NULL | NULL | s1 | eqiad | 2017-11-16T17:15:23.000810 | 171974884 | db1063-bin.001382 | 234441239 | NULL | NULL | s5 | eqiad | 2017-08-22T07:34:57.000440 | 171978775 | db1068-bin.000516 | 439889287 | NULL | NULL | s4 | eqiad | 2018-04-25T13:12:01.000630 | 171978777 | db1070-bin.001894 | 46969145 | NULL | NULL | s5 | eqiad | 2018-04-25T13:12:01.000630 | 171978778 | db1071-bin.006549 | 147776760 | NULL | NULL | s8 | eqiad | 2019-09-11T05:33:39.001920 | 180355171 | db2048-bin.005384 | 980454332 | db1067-bin.002779 | 346367189 | s1 | codfw | 2017-05-10T13:29:31.001050 | 180359172 | db2016-bin.002977 | 503605243 | db1052-bin.004793 | 401461576 | s1 | codfw | 2017-05-16T06:34:43.000930 | 180359174 | db2018-bin.003211 | 1032717323 | db1075-bin.001439 | 632039344 | s3 | codfw | 2017-05-09T09:24:39.001160 | 180359175 | db2019-bin.002580 | 364384048 | db1068-bin.000049 | 373607821 | s4 | codfw | | 180359179 | NULL | NULL | NULL | NULL | NULL | NULL | 2020-11-20T12:53:24.000700 | 180363268 | db2112-bin.002509 | 488527459 | db1083-bin.005754 | 610841608 | s1 | codfw +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+--------
root@pc2010.codfw.wmnet[(none)]> SELECT NOW(),server_id,ts,TIMEDIFF(NOW(),ts) FROM heartbeat.heartbeat WHERE server_id!=@@server_id ORDER BY ts DESC LIMIT 1; +---------------------+-----------+----------------------------+--------------------+ | NOW() | server_id | ts | TIMEDIFF(NOW(),ts) | +---------------------+-----------+----------------------------+--------------------+ | 2020-11-20 10:50:20 | 171966644 | 2020-11-20T10:49:57.000820 | 00:00:22.999180 | +---------------------+-----------+----------------------------+--------------------+ 1 row in set (0.034 sec)
root@db1139.eqiad.wmnet[(none)]> SELECT NOW(),ts,TIMEDIFF(NOW(),ts) FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1; +---------------------+----------------------------+--------------------+ | NOW() | ts | TIMEDIFF(NOW(),ts) | +---------------------+----------------------------+--------------------+ | 2020-11-20 10:44:16 | 2020-11-19T02:54:04.001120 | 31:50:11.998880 | +---------------------+----------------------------+--------------------+
root@db2093.codfw.wmnet[heartbeat]> SELECT TIMEDIFF(ts,NOW()) FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1; +--------------------+ | TIMEDIFF(ts,NOW()) | +--------------------+ | 00:00:00.000880 | +--------------------+ 1 row in set (0.034 sec)
Nov 19 2020
Just reproduced this in chrome, and got this message:
Access to XMLHttpRequest at 'https://idp.wikimedia.org/login?service=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time(mysql_exporter_last_scrape_error%255B5m%255D)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dtrue%26time%3d1605799406.496%26_%3d1605795670420' (redirected from 'https://thanos.wikimedia.org/api/v1/query?query=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&dedup=true&partial_response=true&time=1605799406.496&_=1605795670420') from origin 'https://thanos.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.
Tested a simple case in pontoon, works correctly
Nov 18 2020
One thing that's not currently clear how to handle is starting/stopping pt-heartbeat on masters.
Hi. I have merged the gerrit change, but
- i'm not sure how long it's going to take to take effect
- the current redirect is a 301 (a 'permenant' redirect), so browsers that have visited the url before are likely to still go to the old location.
Fixed by https://gerrit.wikimedia.org/r/639765. From the commit description:
This should now be fixed by https://gerrit.wikimedia.org/r/641402. Needs testing.
Check completed successfully, we're done \o/
root@db2093.codfw.wmnet[orchestrator]> select hostname from database_instance; +--------------------+ | hostname | +--------------------+ | db1077 | | pc1010 | | pc1007.eqiad.wmnet | | pc1008.eqiad.wmnet | | pc2007.codfw.wmnet | | pc2008.codfw.wmnet | | pc2010.codfw.wmnet | | db1077.eqiad.wmnet | | pc1010.eqiad.wmnet | +--------------------+
If there's an entry in database_resolve that maps to a bare hostname, e.g.:
Running one last check over all section instances to confirm the change has been made everywhere.