Page MenuHomePhabricator

When switching DCs, update pc hosts in tendril
Closed, DeclinedPublic

Description

@Marostegui points out that after I ran the 08-update-tendril cookbook, the entries for pcNNNN hosts still needed to be updated by hand. They should be included as part of that automation.

In order to compute the Tendril update, the script loops over for section in mysql_legacy.CORE_SECTIONS, which doesn't include pc1, pc2, or pc3 -- but I don't think the right fix is to add them there. That would also mean we operate on them e.g. while setting the DBs read-only, and then read-write again. With nothing but optimism in my heart, I'm assigning this task over to the Touched It Last owner of the mysql_legacy module, to figure out what needs to happen here. (A reasonable answer is "we'll cover this with the new mysql module" -- this task is just to make sure the cookbook does the right thing, one way or the other.)

Note also that the Cumin query used by mysql_legacy in get_core_dbs() to identify the DB masters won't work without modification, since it starts with "A:db-core and..." and that alias doesn't include the pcNNNN hosts. Very reasonably, since parsercache isn't part of the core DBs.

And one last thing, in passing:

rzl@cumin1001:~$ sudo cumin 'A:eqiad and A:db-section-pc1 and A:db-role-master'
2 hosts will be targeted:
pc[1007,1010].eqiad.wmnet

That seems like it ought to only be pc1007, right? Same in codfw.

Event Timeline

RLazarus triaged this task as Medium priority.Oct 28 2020, 10:23 PM
RLazarus updated the task description. (Show Details)

Thanks @RLazarus.
pc1 is a bit different than the rest, as it has 2 hosts per DC rather than the normal pc1008->pc2008 (pc2) or pc1009 -> pc2009.

pc1 has the following topology

Captura de pantalla 2020-10-29 a las 13.25.01.png (207×1 px, 32 KB)

We have pc1010 and pc2010 there just for HA purposes. They are "floating" hosts that can go to any other section (pc2 or pc3) if needed. And for the sake of having them replicating from somewhere so they are not fully cold, we placed them on pc1. But they could have happened to be on pc2 or on pc3.

Just for the record, this is what I did to "fix" the tendril issue:

root@db1115.eqiad.wmnet[tendril]> select id,host from servers where host like 'pc100%';
+------+--------------------+
| id   | host               |
+------+--------------------+
| 1719 | pc1007.eqiad.wmnet |
| 1698 | pc1008.eqiad.wmnet |
| 1743 | pc1009.eqiad.wmnet |
+------+--------------------+
3 rows in set (0.001 sec)

And then:

update shards set master_id=1719 where name='pc1' limit 1;
update shards set master_id=1698 where name='pc2' limit 1;
update shards set master_id=1743 where name='pc3' limit 1;

If I can provide more background, unless normal circumstances, pc* hosts are active-active, and no change should happen on them (no read only changes, etc.). This was solved on zarcillo by setting masters per datacenter so no change has to happen. Because zarcillo never substituted tendril, the issue is not as much with the switchover scripts as with tendril model, which can only setup one master per global replica set, an not one per datacenter. Per convenience, on tendril the "masters" are considered the ones on the active dc, but that is not really accurate to reality.

I think when tendril disappears most of these issues will too.

This is probably not worth the effort if we are expecting to drop tendril "soon". We can update these manually for the next switch (and switch back) and hopefully for the next one tendril will be no longer. Up to @LSobanski and @Kormat

I discussed this with @RLazarus back in october, and we agreed it's not worth the effort given the impending any-day-now™ tendril decomm. (I forgot to update the task with that, apologies).