Page MenuHomePhabricator

Set up replication for zarcillo
Closed, ResolvedPublic

Description

Currently zarcillo is single-homed on db2093. We should set up replication between it and db1115, where tendril is hosted.

Also add a CNAME to point to the zarcillo master.

Question: given that eqiad is currently the master DC, should the master zarcillo instance be on db1115?

Event Timeline

Marostegui triaged this task as Medium priority.Jul 13 2020, 1:24 PM
Marostegui moved this task from Triage to Pending comment on the DBA board.
Marostegui subscribed.

zarcillo was moved to db2093 because db1115 was not stable for a few week, but since: T252331 T231165 and T231182 were solved, we've not had any issues anymore. We still have T231769 but that's "fine" I think,
I think we can revert and let db1115 be the master.

we've not had any issues anymore

I honestly don't trust tendril, we said many times "Issues seems now fixed/mitigated" and they end up coming back. I think for performance reasons we should have zarcillo close to the primary datacenter, but not until tendril really disappears. OOM will happen and that's a big tax for me specifically to retry backups even if only happens once a year.

As a compromise between both viewpoints, I think having a round replication topology could work in this case, and avoid accidental data loss due to "writing to the wrong server" until we have a more stable setup.

Alternatively, we can request virtual machines on production for both dc instances and that way we can easily separate both services so they don't interact, until tendril goes away.

we've not had any issues anymore

I honestly don't trust tendril, we said many times "Issues seems now fixed/mitigated" and they end up coming back. I think for performance reasons we should have zarcillo close to the primary datacenter, but not until tendril really disappears. OOM will happen and that's a big tax for me specifically to retry backups even if only happens once a year.

If having to retry backups everytime HW or the DB fails is human costly, we should probably invest time on easing that a bit or even making it automatic as much as possible. HW issues, punctual maintenances and network blips will always be there.
By having zarcillo in a different DC on a writable database, where everything else is set to RO, is already confusing and a snowflake - that has bitten us a few times until we got used to that fact.

I don't expect tendril to be out in the next 6 months at the very least, so if the reason not to have zarcillo on the same tendril host is the cost of having to retry backups, let's tackle that I would say.
We cannot be 100% sure that wherever we host zarcillo will always be up, especially if shared with more stuff.

We cannot be 100% sure that wherever we host zarcillo will always be up, especially if shared with more stuff.

Hence see my last comment.

That also means introducing even more infra - which would also be different from the rest (VMs) - why not trying to make the retrying process a bit easier or auto-healing (maybe even an OKR for next Q?)?

let's tackle that I would say

I am sorry that hosting the backup logs database was such overhead, I honestly thought it was much less resource intensive for DBAs. I will ask for resources elsewhere and split that into my own task so I don't bother you more here.

I have never said it is an overhead and you know very well it is not resource intensive - my point is: let's try not have more special cases and let's try to have things as consistent as possible (ie: always having the same DC as writtable), and if we do need to have special snowflakes, let's make sure it is for a good reason, and very well documented. Again, moving it to codfw (cause db1115 was crashing all the time) bite us many times lately until we got used to have a writable database in a different DC.

Moving the database away isn't the solution, as you'd have the same problem: punctual issues wherever it is hosted. That's why I am suggesting to tackle (in future Qs) the fact that re-trying a backup (if it has failed, for any reason) isn't a costly task for you, or at least, less costly.

Zarcillo isn't only backup logs database, it is also our asset, and it is widely used for schema changes, swichover and so forth. So I think it should be hosted along with the rest of the infra.
There is no need to ask for other resources or isolate it I think.

I've created T258045 for the backups database. You can freely decide about zarcillo now.

So it turns out that work on T257816 unveiled that there were a lot of hardcoded endpoints that made that task, not only an option, but a requirement to acheive this one. More work will be needed, but at least now things are not hardcoded on the check script and can be updated on puppet when this happens.

Change 613136 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Enable binlogs for zarcillo

https://gerrit.wikimedia.org/r/613136

Change 613136 merged by Kormat:
[operations/puppet@production] mariadb: Enable binlogs for zarcillo

https://gerrit.wikimedia.org/r/613136

Mentioned in SAL (#wikimedia-operations) [2020-07-16T13:04:37Z] <kormat> restarting tendril to pick up new mariadb config T257816

Change 613149 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] section,report_users: Change active zarcillo host

https://gerrit.wikimedia.org/r/613149

Change 613149 abandoned by Marostegui:
[operations/software@master] section,report_users: Change active zarcillo host

Reason:

https://gerrit.wikimedia.org/r/613149

Change 613150 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] section,report_users: Change active zarcillo host

https://gerrit.wikimedia.org/r/613150

Change 613152 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Update zarcillo location to db1115

https://gerrit.wikimedia.org/r/613152

Change 613150 merged by Marostegui:
[operations/software@master] section,report_users: Change active zarcillo host

https://gerrit.wikimedia.org/r/613150

Change 613152 merged by Kormat:
[operations/puppet@production] mariadb: Update zarcillo location to db1115

https://gerrit.wikimedia.org/r/613152

Change 613157 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/dns@master] wmnet: Add zarcillo-master CNAME.

https://gerrit.wikimedia.org/r/613157

Change 613157 merged by Kormat:
[operations/dns@master] wmnet: Add zarcillo-master CNAME.

https://gerrit.wikimedia.org/r/613157

Change 613158 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/software/wmfmariadbpy@master] switchover: update zarcillo db location

https://gerrit.wikimedia.org/r/613158

Change 613158 merged by jenkins-bot:
[operations/software/wmfmariadbpy@master] switchover: update zarcillo db location

https://gerrit.wikimedia.org/r/613158

Replication is in place, but monitoring is not.

Change 614747 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add replication monitoring for zarcillo

https://gerrit.wikimedia.org/r/614747

Change 614747 merged by Kormat:
[operations/puppet@production] mariadb: Add replication monitoring for zarcillo

https://gerrit.wikimedia.org/r/614747

Monitoring is not properly in place, but going to track that in T258566.