User Details
- User Since
- Aug 29 2023, 8:30 AM (43 w, 16 h)
- Availability
- Available
- LDAP User
- Arnaudb
- MediaWiki User
- ABran-WMF [ Global Accounts ]
Yesterday
Mon, Jun 24
Fri, Jun 21
ah indeed, mybad
private data has been sanitized
view database has been created with the proper accounting
private data has been trimmed, btmwiki_p database created with labsdb grants
I depooled the host by reflex, its currently repooling right now
Thu, Jun 20
fyi https://gerrit.wikimedia.org/r/1048006 and https://gerrit.wikimedia.org/r/1047983 are bound together, related to pt-heartbeat monitoring
server is repooling
Completely agreeing with you, we'll pay attention to avoid such regressions and approximations during the migration! The good news is that we can iterate and compose our alert thresholds as much as we need before deciding on this migration being "done".
The first patch I've sent:
this task's scheduling is swapped with T365987
this task's scheduling is swapped with T365986
This will be run from cumin2002 as 1002 has to be be rebooted soon.
thanks @Ladsgroup @jcrespo for those considerations. This speaks volumes to help defining alerting thresholds. I was unaware of T253120 and T252952 in that context. I find it relevant to first test a more vanilla approach then.
Wed, Jun 19
Here is the implementation then
@jcrespo Thank you for the precision, I clearly see the point you were making! Indeed I was missing the metric aggregation part. I think the best angle will be then to enable this probe on the exporter and to also implement the query as we have it in check_mariadb.pl then. @Ladsgroup @Marostegui feel free to challenge this idea as well.
Correct me if I'm wrong but the heartbeat updates are coming from this script which is called by that service.
So, that would not change (or at least not in this iteration, nor that group of tasks 😄) at all and pt-heartbeat ←→ mediawiki relationship would stay 100% the same. What I'm aiming at here is the way we're alarming on those metrics, if you check mysqld-exporter's code, it does not update at all that ts as it's not his job. I think there was some confusion around my intentions on that comment, I hope I'm a bit clearer now
if you check that screengrab of the query afaict, we're seeing the same info added to the metric but from the config standpoint. That's why I was a bit perplex!
@Ladsgroup had a neat suggestion we just try the current exporter. I think it'll save a few customs (if not all) down the line: https://grafana.wikimedia.org/goto/-AH4B_8SR?orgId=1 here is pt-heartbeat as seen from the exporter point of view. I don't see a clear difference with the current icinga/perl implementation. We could maybe add the section label directly through the monitoring config script to keep the exporter's config as generic as possible.
Tue, Jun 18
Mon, Jun 17
taking that paste in note, thanks! :)
Fri, Jun 14
Thu, Jun 13
will suggest a hierarchy and ask for validation @jcrespo @Marostegui @Ladsgroup → lets try to keep a good signal/noise ratio
this error popped today:
10:05:14 <+icinga-wm_> PROBLEM - MariaDB Replica SQL: s2 on db2125 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
Wed, Jun 12
Scaffolding started here: https://gitlab.wikimedia.org/repos/sre/wmf-mariadb-exporter
dupes T315866
my pleasure!
As for the time: indeed, but on my timezone, so please adjust to the proper timestamp if you want to take that as a reference. It's from a quick copy/paste from IRC to make sure this was not forgotten.