Page MenuHomePhabricator

Early testing of the new Wiki Replicas multi-instance architecture
Open, MediumPublic

Description

The multi-instance architecture cluster is ready for public testing as of February 1st, 2021.

Testing instructions

Remember the new replicas are in beta right now, so there could be unexpected problems. Please let us now how it goes for you -or if you have any questions or problems- in the comments of this task or in any of the support channels like the #wikimedia-cloud IRC channel.

For testing the new replicas, you will need to use a new hostname. Everything else remains the same -within the announced restrictions, no cross-wiki joins, use the DB corresponding to the hostname you are using to connect-, so feel free to check the documentation on Wikitech.

OldNew
${PROJECT}.{analytics,web}.db.svc.eqiad.wmflabs${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud
eswiki.web.db.svc.eqiad.wmflabseswiki.web.db.svc.wikimedia.cloud
s${SECTION_NUMBER}.{analytics,web}.db.svc.eqiad.wmflabss${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud
s7.web.db.svc.eqiad.wmflabss7.web.db.svc.wikimedia.cloud

Basic example

Logged in on Toolforge:

$ ssh login.toolforge.org

$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud enwiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
ERROR 1049 (42000): Unknown database 'enwiki_p'

$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|     1207 |
+----------+

$ mysql --defaults-file=replica.my.cnf -h eswiki.analytics.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|     1207 |
+----------+

Advanced use cases

There is an up-to-date meta_p database on s7, it's totally correct right now *on the new replicas only*. It has been rebuilt but it could experience drift again in the future. Do not rely solely on it and consider falling back to parsing dblists from noc if there are issues.

If you are reading this and you don't know what it means, please just use the host names as specified above.


If you are can help test the cluster or want to migrate your tools early to ensure your tools will work well, please subscribe and comment and let us know if you have any questions or problems. Thank you!

Event Timeline

Community-Tech would be happy to participate in early testing. All we need to do to test the new infrastructure is update our code in accordance with https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#What_should_I_do%3F , correct?

If so, CopyPatrol and Tool-Pageviews should already work. For XTools, Event Metrics and WS Export, we're writing a Symfony Bundle (T271348) to assist with local development (to address the use case discussed on the mailing list). We will be sure to share our solution once it is ready.

Hi, we would love to test with our code (T263678). We already do connect to databases we need explicitly and we don't have any inter-wiki joins, so that's good. Although when working locally, we connect with meta and use all other dbs as required because connecting to *all* the dbs with SSH is quite a hassle. I believe this shortcut will not work anymore? I think we need to handle this hassle with mappings.

[...] All we need to do to test the new infrastructure is update our code in accordance with https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#What_should_I_do%3F , correct?

Yes, and to test the new cluster there will be new hostnames, instead of <wikicodename>.(analytics|web).db.svc.eqiad.wmflabs the new names will be something along the lines of <wikicodename>.(analytics|web).db.svc.wikimedia.cloud (not finalized, please wait for further info once everything is in place).

Once the old cluster is decommissioned, all the old hostnames will point at the new cluster transparently.

[...] Although when working locally, we connect with meta and use all other dbs as required because connecting to *all* the dbs with SSH is quite a hassle. I believe this shortcut will not work anymore? I think we need to handle this hassle with mappings.

@tanny411 Indeed this won't be possible in the new cluster. See @MusikAnimal's comment, they have similar concerns for some of their tools, and will be coding something to use the mappings and ease that pain and share their solution. I'm sure it will be helpful later.

There may also be some other way to ease local development, I'm sure there will be more people with the same problems and we can figure best practices together.


Thanks both for keeping tabs on this, very much appreciated. Once we have clear information I will update the wiki page too with the testing instructions (the hostnames) and post here too.

Jhernandez renamed this task from Early testing of the multi-instance architecture to Early testing of the new Wiki Replicas multi-instance architecture.Jan 21 2021, 2:02 PM

The new replicas are ready for testing! Remember this is in beta right now, and please let us now how it goes for you -or if you have any questions or problems- here or in any of the support channels like IRC #wikimedia-cloud.

I'll be posting these instructions and any further updates in the task description too.

For testing the new replicas, you will need to use a new hostname. Everything else remains the same -within the announced restrictions, no cross-wiki joins, use the DB corresponding to the hostname you are using to connect-, so feel free to check the documentation on Wikitech.

OldNew
${PROJECT}.{analytics,web}.db.svc.eqiad.wmflabs${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud
eswiki.web.db.svc.eqiad.wmflabseswiki.web.db.svc.wikimedia.cloud
s${SECTION_NUMBER}.{analytics,web}.db.svc.eqiad.wmflabss${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud
s7.web.db.svc.eqiad.wmflabss7.web.db.svc.wikimedia.cloud

Basic example

Logged in on Toolforge:

$ ssh login.toolforge.org

$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud enwiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
ERROR 1049 (42000): Unknown database 'enwiki_p'

$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|     1207 |
+----------+

$ mysql --defaults-file=replica.my.cnf -h eswiki.analytics.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|     1207 |
+----------+

Advanced use cases

There is an up-to-date meta_p database on s7, it's totally correct right now *on the new replicas only*. It has been rebuilt but it could experience drift again in the future. Do not rely solely on it and consider falling back to parsing dblists from noc if there are issues.

If you are reading this and you don't know what it means, please just use the host names as specified above.

Does ToolsDB have a name under wikimedia.cloud or should I continue using tools.db.svc.eqiad.wmflabs?

Another question: where can I find the heartbeat_p database on the new setup?

We do have the heartbeat database on each instance (ie: clouddb1013:3311) but I don't see the view there, so that might need to be created. Keep in mind that each instance will have now its own heartbeat table.

That appears to have been manually created or something on the legacy replicas. I'll have to check definitions and such.

It looks like:

root@labsdb1012.eqiad.wmnet[heartbeat_p]> show create table heartbeat\G
*************************** 1. row ***************************
                View: heartbeat
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`root`@`localhost` SQL SECURITY DEFINER VIEW `heartbeat` AS select `heartbeat`.`heartbeat`.`shard` AS `shard`,max(`heartbeat`.`heartbeat`.`ts`) AS `last_updated`,(greatest(timestampdiff(MICROSECOND,max(`heartbeat`.`heartbeat`.`ts`),utc_timestamp()),0) / 1000000.0) AS `lag` from `heartbeat`.`heartbeat` group by `heartbeat`.`heartbeat`.`shard`
character_set_client: binary
collation_connection: binary
1 row in set (0.001 sec)

The grant for labsdbuser is actually there already:

root@clouddb1013.eqiad.wmnet[(none)]> show grants for labsdbuser;
| GRANT SELECT, SHOW VIEW ON `heartbeat\_p`.* TO `labsdbuser`

@Bstorm if you want me to create the _p database let me know.

@Marostegui If you could, that would be great.

I have created the _p database and the view.
However, the view needs some cleaning as it has some old entries there:

root@cumin1001:/home/marostegui# mysql.py -hclouddb1020:3318 -e "select * from heartbeat_p.heartbeat"
+-------+----------------------------+----------------+
| shard | last_updated               | lag            |
+-------+----------------------------+----------------+
| NULL  | 2013-04-17T19:16:14.000660 | 245990303.9993 |
| s5    | 2018-01-09T06:07:05.001020 |  96738452.9990 |
| s8    | 2021-02-01T21:54:38.000970 |         0.0000 |
+-------+----------------------------+----------------+

I will try to get it cleaned during the week

I released new versions of my toolforge library to connect to the new replicas (full documentation):

  • Python: 5.0.0b1 (see changelog)
  • Rust: 0.3.0-beta.1

You should just be able to upgrade and it'll start using the new hostnames (of course, you also need to check your queries to make sure you're not doing any cross-joins, etc.).

Seems to be down

$ ssh login.toolforge.org

$ mysql --defaults-file=replica.my.cnf -h eswiki.analytics.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)"

I can confirm that the replicas that host eswiki are up (everything is up), so maybe with the firewall/proxies?
dbproxy1018 and dbproxy1019 do see all the instances as UP.

I think something has changed. I cannot access the new replicas at all.
mysql --defaults-file=replica.my.cnf -h enwiki.web.db.svc.wikimedia.cloud enwiki_p with no query does that. I'll check through the chain. Something may have change/broken the routing.

I see s1 and s5 are labeled "down" at the proxy. Some other sections are ok.

Actually others seem to be dropping. I'll check the upstream proxy.

Restarting haproxy fixed it. I think I know what is up. This is because there are cascading proxies and we are going through the LVS, I suspect. The keepalives started failing for a bit, and the backend was marked down. I will change the front proxy so that it does not do that. Only the back proxy should really handle keepalives and healthchecks.

Healthchecks definitely were failing at this layer:

Feb  2 00:48:05 clouddb-wikireplicas-proxy-1 haproxy[5546]: Health check for server mariadb-s6/208.80.154.243-s6 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.4.15-MariaDB", check duration: 2ms, status: 20/20 UP.
Feb  2 00:48:05 clouddb-wikireplicas-proxy-1 haproxy[5546]: Health check for server mariadb-s6/208.80.154.243-s6 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.4.15-MariaDB", check duration: 2ms, status: 20/20 UP.
Feb  2 00:48:16 clouddb-wikireplicas-proxy-1 haproxy[5546]: Health check for server mariadb-s3/208.80.154.243-s3 failed, reason: Layer4 connection problem, info: "Network is unreachable", check duration: 0ms, status: 19/20 UP.

Change 661132 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas-proxy: front proxy should not keep connections down

https://gerrit.wikimedia.org/r/661132

We lost LDAP at some point as well. That goes through LVS too. @BBlack Did anything change yesterday? Otherwise this may just be slowness or latency somehow.

Change 661132 merged by Bstorm:
[operations/puppet@production] wikireplicas-proxy: front proxy should not keep connections down

https://gerrit.wikimedia.org/r/661132

Ok, I have changed the behavior of the front proxy, which is likely to be vulnerable to traffic fluctuations at the LVS layer, unfortunately. Now it should recover on its own quickly. It is also slow to drop a backend. I expect that there may be additional tweaks possible that I'll work on at T271476.

bstorm@tools-sgebastion-08:~$ mysql --defaults-file=replica.my.cnf -h enwiki.web.db.svc.wikimedia.cloud enwiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|      366 |
+----------+

Thank you for the bug reports!!!! @Zache and @Jhernandez

Thanks for fixing. I was able to log in but connection (or timeouts) is still little bit shaky.

MariaDB [wikidatawiki_p]> select min(rev_id) from revision where  rev_timestamp>20200005080436;
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [wikidatawiki_p]> select max(rev_id) from revision where  rev_timestamp<20200005080436;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    2424319
Current database: wikidatawiki_p

ERROR 2013 (HY000): Lost connection to MySQL server during query

I'll keep picking at this today. That's really not good.

@Zache I don't see drops at the proxies since I last restarted them, so you'd be seeing either latency at another layer for me to chase down or the actual restart I did. Was that connection started after 15:55 UTC? If so, then I have something else to ponder about.

Never mind, just reproduced your error. I'll dig.

Yes, the connection was started after your "Tue, Feb 2, 3:53 PM" comment.

Yeah. So I have to find which layer hung up. It wasn't the front proxy. It was also quick. I'd expect that from the query killer or something, but there's no way that was it.

Change 661158 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas-proxy: tune the main haproxy config for databases

https://gerrit.wikimedia.org/r/661158

Change 661158 merged by Bstorm:
[operations/puppet@production] wikireplicas-proxy: tune the main haproxy config for databases

https://gerrit.wikimedia.org/r/661158

@Zache So far, I think that last tweak may have fixed the problem. I'm running a long query, and it is holding.

Also:

MariaDB [wikidatawiki_p]> select max(rev_id) from revision where  rev_timestamp<20200005080436;
+-------------+
| max(rev_id) |
+-------------+
|  1348244083 |
+-------------+
1 row in set (11 min 37.10 sec)

I think that genuinely fixed the problems. Thanks again for finding these issues.

For PHP/Symfony users, the ToolforgeBundle has been updated to include a replicas connection manager. This goes by the dblists at noc.wikimedia.org to ensure your app has no more open connections than it needs to. It also has a simple command (php bin/console toolforge:ssh) to open an SSH tunnel for easier development on local environments. See docs at https://github.com/wikimedia/ToolforgeBundle#replicas-connection-manager

There is an up-to-date meta_p database on s7, it's totally correct right now *on the new replicas only*. It has been rebuilt but it could experience drift again in the future. Do not rely solely on it and consider falling back to parsing dblists from noc if there are issues.

I'm not sure how to interpret this, if meta_p can't be trusted to be up to date, why would we use it at all? Grabbing from noc seems a bit hacky, are there bugs about the issues with meta_p that we could work on to make it reliable?

I'm not sure how to interpret this, if meta_p can't be trusted to be up to date, why would we use it at all? Grabbing from noc seems a bit hacky, are there bugs about the issues with meta_p that we could work on to make it reliable?

Meta_p is updated when a wiki is added. It is not updated later (normally). So if a wiki is moved to a new section, it may not be captured. Therefore, it is usually a reasonable place to check. I just caution people against trusting it. Also, simply checking DNS is much faster than both options. Plenty of libraries (and dig) can check what a dns name is a CNAME of in miliseconds. It just matters which question you are asking.

If you are asking "which section is this database in", DNS is the fastest way, honestly. If you are asking "what databases are in this section", then noc or meta_p is likely faster.

To be clear, moving wikis to new sections isn't something that happens often. However, DNS is updated when we do related operations, based on the content on noc. It is possible to update meta_p during wiki moves we are informed about, but this was not done in the past on legacy replicas.

Hello! On which section does the centralauth_p database live? It does not appear to be on any of the db lists at noc.wikimedia.org. Perhaps a view needs to be created for it, too?

I see from the description that meta_p lives on s7. Can we depend on it to always live on that section? It also is not listed at https://noc.wikimedia.org/conf/dblists/s7.dblist

It is on s7 as well, however as you correctly guessed, the views aren't there, so it needs creation /cc @Bstorm

Eeek! It's supposed to be on s7. I'll try to find out what's up with that today.

@Marostegui I found the bug in my scripts and am patching that. However, in testing on clouddb1014, I found that on this version of mariadb I still cannot create the grant 'GRANT SELECT, SHOW VIEW ON centralauth\_p.* TO 'labsdbuser';' in the script, which makes me sad. I'd hoped that was fixed in the course of upgrades.

pymysql.err.OperationalError: (1044, "Access denied for user 'maintainviews'@'localhost' to database 'centralauth\\_p'")

I created the db and grant manually on clouddb1014, and the script ran. I'll get the patch up and fix it on clouddb1018.

Change 664860 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: fix the centralauth management bit of the view scripts

https://gerrit.wikimedia.org/r/664860

Change 664860 merged by Bstorm:
[operations/puppet@production] wikireplicas: fix the centralauth management bit of the view scripts

https://gerrit.wikimedia.org/r/664860

Ok, centralauth is up as well. Thanks for the report @MusikAnimal

Ok, centralauth is up as well. Thanks for the report @MusikAnimal

Thanks for the quick fix!

Is there a guarantee that meta_p and centralauth_p will stay on s7? If not, how can I programmatically locate which section it's on?

Also, is the timeline at https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timeline up-to-date? This seems to imply the switchover will happen by March. I believe it's highly unlikely that my team will to be able fix all of our tools by then, and I haven't gotten around to fixing my bots :( Is there any chance we could get a few more weeks grace time?

centralauth is on s7 universally I think? It just doesn't get listed in the dblists, so it needs specific special cases in the scripts. That special case was a bit mixed up in my first pass. It's on s7 on the replicas because it is upstream, and it would only change if the Foundations upstream database design changed.

Meta_p is specific to the wikireplicas, so I can definitely guarantee it will stay on s7 as long as the foundation databases are using the s1-s8 section architecture.

I'll have to get back to you on the deadline and specifics around that. The feedback there helps, too. We were just discussing that this morning.

There are no plans at all about moving centralauth out of s7 in production database. Moving wikis is very hard and it is only done in very unique situations.

@MusikAnimal, I clarified the timeline on the wiki to address your question. In short, barring unforeseen implementation issues, the new cluster will transition to be the default in March. You should expect more communication and announcements as this happens. I encourage you to keep planning your migration and utilizing the new cluster for testing.

The ability to stretch the timeline on removing the old cluster however is limited by how long the old cluster can remain operational and replicating. Ideally all existing bots and services will be migrated before shutting off the old cluster, but there's no technical guarantee it will continue to properly replicate and have up to date data. So flexibility exists, but is limited. It's for this reason I encourage you to act now. Please keep in touch (as you have) on progress and concerns as the transition happens. Thanks!

I've updated these pages in the docs to point to the new names:

Will be sending an email with the refreshed timeline and more info tomorrow.