User Details
- User Since
- May 11 2015, 8:31 AM (474 w, 22 h)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Fri, Jun 7
This is ready for dc-ops.
This is ready for dc ops.
This is ready for dc ops.
Wed, May 29
I will migrate the backups to 10.6 without removing yet the 10.4 backup sources.
@Volans not Amir, but Re: your first question, my understanding is that this was a compromise to make sure there was something good enough and simple short term, rather than overengineering from the start. That doesn't mean that what you suggest is discarded, but something that could be improved later on. For example, I am personally interested on having a querable service/API later for backup checks, but this is better than nothing ATM, with relatively small effort. Later on, a database could import the file and generate it, for example. So I am a fan of interating slowly as long as it is an improvement 0:-D.
Thu, May 23
Wed, May 22
Followup to T361087.
I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller weirdness.
Resolving for now.A disk was rebuilt on the 17 of May:
Tue, May 21
- Stop es4 and es5 backups
- Generate a full clusterX and clusterY last backup
- Archive it into long term backups
- Remove dump user
Tue, May 14
Mon, May 13
Thanks, the upgrade is no issue, but data will have a lot of backup errors due to not beeing depooled before maintenance, will need some work.
May 9 2024
All backups now will be generated from 10.6 servers, with the exception of s1. Leaving a couple of hosts with 10.4 before upgrading them/decomming them.
@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:
May 7 2024
May 6 2024
It was failing back in 2021:
Here it is the 2 file versions (with the hash it can be checked they are the same files):
Apr 30 2024
Apr 25 2024
In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.
If booted into bullseye.
Booting failed (PXE):
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Apr 24 2024
Will reimage soon.
Apr 23 2024
Looking good now:
hi, backups of matomo database failed with:
update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.
Apr 18 2024
Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is pending. Can it be restarted or should it be kept with the old config for a while, and it should be acked?
Apr 17 2024
Apr 16 2024
Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1
@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.
[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
Apr 15 2024
Thank you a lot, to everybody!
CC @ABran-WMF in case I missed something.
Apr 12 2024
I think we can resolve this and track that at T358741, as long as everybody is aware.
This is now done, although it depends on the definition of productionize- as some of the backup sources have the exact same data and config than the original ones, but have not yet taken over the service, and some backups still use the old hosts.
hi, we cannot ssh into dbprov1006.eqiad.wmnet
Apr 11 2024
No need. I just wanted to warn the DBAs- althought you may find it interesting, as the last issue was with wikireplicas. No need to change anything at the moment (actual data and name), but the current grants providing access.
Please see my last comment. Other than that, my work is done.