Page MenuHomePhabricator

Marostegui (Manuel Aróstegui)
Staff Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (408 w, 8 h)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Today

Marostegui added a comment to T365453: Bring an-redacteddb1001 into service to replace clouddb1021.

@BTullis please address T368354 when you can, otherwise we are sort of blind from cumin hosts

Thu, Jun 27, 10:11 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)

Yesterday

Marostegui added a comment to T367778: [wikireplicas] frequent replag spikes in clouddb hosts.

That query is pretty expensive and it is basically scanning 146M rows, which is unlikely it'll be finish before the 10800 mark.

A long-running query should not automatically cause replication lag, though. Why is this query causing the lag? Is it locking some tables and the replication thread has to wait? Replication is not entirely stuck, it's just slower apparently.

Wed, Jun 26, 7:20 PM · Data-Services, cloud-services-team (FY2023/2024-Q3-Q4)
Marostegui added a comment to T365805: Test MariaDB 10.11.

I've installed 10.11 on db2136 (s4) for now. I've pooled in in production for a couple of hours to capture queries that would take longer than 10 seconds to run. For now the host is depooled again and will only be pooled during certain working hours.

Wed, Jun 26, 9:05 AM · DBA
Marostegui added a comment to T367778: [wikireplicas] frequent replag spikes in clouddb hosts.

Just for the record, I have been investigating the current lag on clouddb1019:3314 - it is because of this:

root@clouddb1019.eqiad.wmnet[commonswiki]> show explain for 3790361;
+------+--------------------+---------------+-------+-------------------------------------------------------+--------------------+---------+------------------------------+-----------+--------------------------+
| id   | select_type        | table         | type  | possible_keys                                         | key                | key_len | ref                          | rows      | Extra                    |
+------+--------------------+---------------+-------+-------------------------------------------------------+--------------------+---------+------------------------------+-----------+--------------------------+
|    1 | PRIMARY            | page          | ALL   | page_name_title,page_redirect_namespace_len           | NULL               | NULL    | NULL                         | 146173850 | Using where              |
|    3 | MATERIALIZED       | categorylinks | range | PRIMARY,cl_timestamp,cl_sortkey                       | cl_timestamp       | 257     | NULL                         |         2 | Using where; Using index |
|    2 | MATERIALIZED       | linktarget    | range | PRIMARY,lt_namespace_title                            | lt_namespace_title | 261     | NULL                         |        12 | Using where; Using index |
|    2 | MATERIALIZED       | templatelinks | ref   | PRIMARY,tl_target_id,tl_backlinks_namespace_target_id | tl_target_id       | 8       | commonswiki.linktarget.lt_id |        92 | Using index              |
|    8 | DEPENDENT SUBQUERY | pagelinks     | ref   | pl_target_id                                          | pl_target_id       | 8       | commonswiki.linktarget.lt_id |         6 | Using index              |
|    7 | DEPENDENT SUBQUERY | templatelinks | ref   | tl_target_id                                          | tl_target_id       | 8       | commonswiki.linktarget.lt_id |        92 | Using index              |
+------+--------------------+---------------+-------+-------------------------------------------------------+--------------------+---------+------------------------------+-----------+--------------------------+
6 rows in set, 1 warning (0.030 sec)
Wed, Jun 26, 6:29 AM · Data-Services, cloud-services-team (FY2023/2024-Q3-Q4)
Marostegui updated subscribers of T368136: [wikireplicas] Make sure there is no sensitive data in clouddb hosts.

So in terms of data, my recap is:

  • root password is differentr from production
  • the data that is present there is sanitized and there's not data there that cannot be queried publicly. Although there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine
  • replication user password I don't think it is such a big deal as the risk of having someone to set up a new replica directly from production is minimum and there would be lots of others things that would need to be done for that to be successful.
  • wikiuser and wikiadmin are no longer there.
  • non-public data (such as suppressed edits or bans) are possibly available - but I don't know enough MW to be able to say if this is still doable or not @Ladsgroup would you know?
Wed, Jun 26, 6:20 AM · SRE, Data-Services, cloud-services-team
Marostegui updated the task description for T367632: Drop ipblocks in production.
Wed, Jun 26, 6:01 AM · DBA
Marostegui added a comment to T367632: Drop ipblocks in production.

Dropping it in s3, which will take around 8 hours.

Wed, Jun 26, 6:00 AM · DBA
Marostegui updated the task description for T367632: Drop ipblocks in production.
Wed, Jun 26, 5:57 AM · DBA
Marostegui updated the task description for T367632: Drop ipblocks in production.
Wed, Jun 26, 5:56 AM · DBA
Marostegui updated the task description for T367632: Drop ipblocks in production.
Wed, Jun 26, 5:01 AM · DBA
Marostegui updated the task description for T367632: Drop ipblocks in production.
Wed, Jun 26, 4:51 AM · DBA
Marostegui added a comment to T368494: Switchover m2 master db1195 -> db1228.

Change #1049664 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] m2 proxies: Test db1228

https://gerrit.wikimedia.org/r/1049664

Wed, Jun 26, 4:47 AM · DBA
Marostegui updated the task description for T368494: Switchover m2 master db1195 -> db1228.
Wed, Jun 26, 4:39 AM · DBA
Marostegui updated the task description for T368494: Switchover m2 master db1195 -> db1228.
Wed, Jun 26, 4:36 AM · DBA
Marostegui updated the task description for T368494: Switchover m2 master db1195 -> db1228.
Wed, Jun 26, 4:30 AM · DBA
Marostegui moved T368494: Switchover m2 master db1195 -> db1228 from Triage to In progress on the DBA board.
Wed, Jun 26, 4:29 AM · DBA
Marostegui triaged T368494: Switchover m2 master db1195 -> db1228 as Medium priority.
Wed, Jun 26, 4:29 AM · DBA
Marostegui created T368494: Switchover m2 master db1195 -> db1228.
Wed, Jun 26, 4:29 AM · DBA
Marostegui updated the task description for T364069: Rebuild pagelinks tables.
Wed, Jun 26, 4:22 AM · DBA
Marostegui closed T368374: Move one host temporarily to m2, a subtask of T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad , as Resolved.
Wed, Jun 26, 4:21 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
Marostegui closed T368374: Move one host temporarily to m2 as Resolved.
Wed, Jun 26, 4:21 AM · DBA
Marostegui added a comment to T368374: Move one host temporarily to m2.

This is done, I will track the master switchover in a different track

Wed, Jun 26, 4:21 AM · DBA

Tue, Jun 25

Marostegui added a comment to T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad.

I will try - but just in case @ABran-WMF please take some notes!

Tue, Jun 25, 10:56 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
Marostegui added a comment to T368136: [wikireplicas] Make sure there is no sensitive data in clouddb hosts.

What sort of data y'all are concerned about exposing to new roots on the replica db hosting nodes themselves? These boxes already expose less data than are exposed to those in the deployment, restricted, or analytics-privatedata-users groups by virtue of the Sanatarium sanitization steps that are semi-obviously not present on the un-redacted source hosts those users can reach with various query mechanisms.

Tue, Jun 25, 10:39 AM · SRE, Data-Services, cloud-services-team
Marostegui added a comment to T368374: Move one host temporarily to m2.

1*************************** 1. row ***************************
2 Slave_IO_State:
3 Master_Host: db1195.eqiad.wmnet
4 Master_User: repl2024
5 Master_Port: 3306
6 Connect_Retry: 60
7 Master_Log_File: db1195-bin.000533
8 Read_Master_Log_Pos: 451132773
9 Relay_Log_File: db1217-relay-bin.000097
10 Relay_Log_Pos: 451133073
11 Relay_Master_Log_File: db1195-bin.000533
12 Slave_IO_Running: No
13 Slave_SQL_Running: No
14 Replicate_Do_DB:
15 Replicate_Ignore_DB:
16 Replicate_Do_Table:
17 Replicate_Ignore_Table:
18 Replicate_Wild_Do_Table:
19 Replicate_Wild_Ignore_Table:
20 Last_Errno: 0
21 Last_Error:
22 Skip_Counter: 0
23 Exec_Master_Log_Pos: 451132773
24 Relay_Log_Space: 451133431
25 Until_Condition: None
26 Until_Log_File:
27 Until_Log_Pos: 0
28 Master_SSL_Allowed: Yes
29 Master_SSL_CA_File:
30 Master_SSL_CA_Path:
31 Master_SSL_Cert:
32 Master_SSL_Cipher:
33 Master_SSL_Key:
34 Seconds_Behind_Master: NULL
35 Master_SSL_Verify_Server_Cert: No
36 Last_IO_Errno: 0
37 Last_IO_Error:
38 Last_SQL_Errno: 0
39 Last_SQL_Error:
40 Replicate_Ignore_Server_Ids:
41 Master_Server_Id: 172001292
42 Master_SSL_Crl:
43 Master_SSL_Crlpath:
44 Using_Gtid: Slave_Pos
45 Gtid_IO_Pos: 171970595-171970595-143176442,0-171970569-1006906062,171970573-171970573-2371267,171970746-171970746-135654591,171978772-171978772-139568865,171966512-171966512-126314222,171970569-171970569-156638323,171966678-171966678-232525439,171970778-171970778-108155596,171970636-171970636-23122305,172001292-172001292-344164559
46 Replicate_Do_Domain_Ids:
47 Replicate_Ignore_Domain_Ids:
48 Parallel_Mode: optimistic
49 SQL_Delay: 0
50 SQL_Remaining_Delay: NULL
51 Slave_SQL_Running_State:
52 Slave_DDL_Groups: 0
53Slave_Non_Transactional_Groups: 0
54 Slave_Transactional_Groups: 48583481

Tue, Jun 25, 9:46 AM · DBA
Marostegui added parent tasks for T368374: Move one host temporarily to m2: Unknown Object (Task), Unknown Object (Task).
Tue, Jun 25, 9:29 AM · DBA
Marostegui triaged T368374: Move one host temporarily to m2 as High priority.
Tue, Jun 25, 9:28 AM · DBA
Marostegui created T368374: Move one host temporarily to m2.
Tue, Jun 25, 9:28 AM · DBA
Marostegui updated the task description for T368371: Switchover s8 master (db1192 -> db1209).
Tue, Jun 25, 9:23 AM · Patch-For-Review, DBA
Marostegui added a parent task for T368371: Switchover s8 master (db1192 -> db1209): T364069: Rebuild pagelinks tables.
Tue, Jun 25, 9:22 AM · Patch-For-Review, DBA
Marostegui added a subtask for T364069: Rebuild pagelinks tables: T368371: Switchover s8 master (db1192 -> db1209).
Tue, Jun 25, 9:22 AM · DBA
Marostegui claimed T368371: Switchover s8 master (db1192 -> db1209).
Tue, Jun 25, 9:21 AM · Patch-For-Review, DBA
Marostegui triaged T368371: Switchover s8 master (db1192 -> db1209) as Medium priority.
Tue, Jun 25, 9:21 AM · Patch-For-Review, DBA
Marostegui added a parent task for T368371: Switchover s8 master (db1192 -> db1209): T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad.
Tue, Jun 25, 9:19 AM · Patch-For-Review, DBA
Marostegui added a subtask for T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad: T368371: Switchover s8 master (db1192 -> db1209).
Tue, Jun 25, 9:19 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
Marostegui added a comment to T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad.

backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.

db1204 is the main mediabackups metadata db- ideally, media backups are stopped while maintenance happens and restored afterwards to avoid errors.

Tue, Jun 25, 9:18 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
Marostegui updated the task description for T364069: Rebuild pagelinks tables.
Tue, Jun 25, 7:15 AM · DBA
Marostegui updated the task description for T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 7:04 AM · DBA
Marostegui closed T368355: Switchover s8 master (db2165 -> db2161), a subtask of T364069: Rebuild pagelinks tables, as Resolved.
Tue, Jun 25, 7:04 AM · DBA
Marostegui closed T368355: Switchover s8 master (db2165 -> db2161) as Resolved.
Tue, Jun 25, 7:04 AM · DBA
Marostegui updated the task description for T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 7:00 AM · DBA
Marostegui updated the task description for T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 6:41 AM · DBA
Marostegui added a comment to T368355: Switchover s8 master (db2165 -> db2161).

old 300

Tue, Jun 25, 6:39 AM · DBA
Marostegui updated the task description for T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 6:28 AM · DBA
Marostegui claimed T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 6:27 AM · DBA
Marostegui updated the task description for T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 6:26 AM · DBA
Marostegui added a parent task for T368355: Switchover s8 master (db2165 -> db2161): T364069: Rebuild pagelinks tables.
Tue, Jun 25, 6:25 AM · DBA
Marostegui added a subtask for T364069: Rebuild pagelinks tables: T368355: Switchover s8 master (db2165 -> db2161).
Tue, Jun 25, 6:25 AM · DBA
Marostegui updated the task description for T367632: Drop ipblocks in production.
Tue, Jun 25, 6:21 AM · DBA
Marostegui updated the task description for T367856: Cleanup revision table schema.
Tue, Jun 25, 6:05 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA
Marostegui updated the task description for T367632: Drop ipblocks in production.
Tue, Jun 25, 6:04 AM · DBA
Marostegui added a comment to T367632: Drop ipblocks in production.

No errors in db1169 for a week

Tue, Jun 25, 6:02 AM · DBA
Marostegui created T368354: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts.
Tue, Jun 25, 5:57 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
Marostegui updated the task description for T364069: Rebuild pagelinks tables.
Tue, Jun 25, 5:32 AM · DBA
Marostegui updated the task description for T367856: Cleanup revision table schema.
Tue, Jun 25, 5:21 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA

Sun, Jun 23

Marostegui added a comment to T368189: db2197 rebooted itself.

If it was stopped correctly it should be fine to start again and resume replication too

Sun, Jun 23, 8:07 PM · Data-Persistence-Backup
Marostegui added a comment to T368189: db2197 rebooted itself.

I didn't bring anything back up as I wasn't aware of what was going on

Sun, Jun 23, 7:55 PM · Data-Persistence-Backup
Marostegui created T368220: Remove AAAA records from an-redacteddb1001 and allow connection from cumin.
Sun, Jun 23, 6:11 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Services, Data-Persistence
Marostegui updated the task description for T367856: Cleanup revision table schema.
Sun, Jun 23, 6:03 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA

Sat, Jun 22

Marostegui updated subscribers of T368189: db2197 rebooted itself.

From what I can see this was part of T367648

Sat, Jun 22, 12:30 PM · Data-Persistence-Backup
Marostegui added a comment to T368189: db2197 rebooted itself.

The downtime message above is wrong, it was downtimed cause it crashed :)

Sat, Jun 22, 5:06 AM · Data-Persistence-Backup
Marostegui updated the task description for T368189: db2197 rebooted itself.
Sat, Jun 22, 5:04 AM · Data-Persistence-Backup
Marostegui created T368189: db2197 rebooted itself.
Sat, Jun 22, 5:04 AM · Data-Persistence-Backup

Fri, Jun 21

Marostegui added a comment to T365453: Bring an-redacteddb1001 into service to replace clouddb1021.

@BTullis can you double check why an-redacteddb1001 isn't having check_private_data runs every day like clouddb1021 has? I detected it because it doesn't have the logs:

Fri, Jun 21, 5:07 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
Marostegui added a comment to T344599: wikireplicas root access.

I think that members of wmcs-roots can now circumvent this by using the cloudcumin hosts, and run a command as root through Cumin.

I also discovered that a new group wikireplica-roots was created recently by @MoritzMuehlenhoff. At the moment it only includes @joanna_borun but it can be used if somebody else needs root access to clouddb hosts.

I would still like to give root access to clouddb* to everybody in wmcs-roots (which at the moment only includes WMF staff but could potentially include volunteers). In this way we can envision a future where WMCS SREs don't need global root at all.

I'll try recapping the concerns that were listed in the discussion above:

Fri, Jun 21, 4:56 AM · Data-Services, cloud-services-team, Infrastructure Security
Marostegui added a comment to T196366: Implement (or refactor) a script to move slaves when the master is not available.

@Marostegui To get the list of direct replicas, something like this would work in cumin:

1import argparse
2import json
3
4import requests
5
6parser = argparse.ArgumentParser()
7parser.add_argument('section', help='Must be the section name in orchestrator')
8args = parser.parse_args()
9data_ = requests.get(
10 'https://orchestrator.wikimedia.org/api/cluster/alias/' +
11 args.section).json()
12db_data = []
13for db in data_:
14 analyzed_db = {}
15 if db['MasterKey']['Hostname'] + ':' + \
16 str(db['MasterKey']['Port']) != db['ClusterName']:
17 # not a direct replica
18 continue
19 db_data.append(db['Key']['Hostname'] + ':' + str(db['Key']['Port']))
20
21print('direct replicas')
22for db in db_data:
23 print(json.dumps(db))

Which outputs something like this:

ladsgroup@cumin1002:~/ladsgroup/software2/dbtools$ python3 direct_replicas.py s2
direct replicas
"db1156.eqiad.wmnet:3306"
"db1182.eqiad.wmnet:3306"
"db1188.eqiad.wmnet:3306"
"db1197.eqiad.wmnet:3306"
"db1222.eqiad.wmnet:3306"
"db1225.eqiad.wmnet:3312"
"db1229.eqiad.wmnet:3306"
"db1233.eqiad.wmnet:3306"
"db1239.eqiad.wmnet:3312"
"db1246.eqiad.wmnet:3306"
"db2207.codfw.wmnet:3306"
"dbstore1007.eqiad.wmnet:3312"

(it only works from cumin)

We can make it find direct replicas for secondary dc too. The hardest part is to refactor db-switchover to take that list (in itself it's not hard, it's that it assumes in many places that the old master is reachable which is a good assumption for the main usecase). Maybe I should copy paste that into a new file and see what happens.

Fri, Jun 21, 4:43 AM · Patch-For-Review, SRE-Sprint-Week-Sustainability-March2023, User-Ladsgroup, Sustainability (Incident Followup), DBA
Marostegui updated the task description for T364069: Rebuild pagelinks tables.
Fri, Jun 21, 4:36 AM · DBA

Thu, Jun 20

Marostegui added a comment to T367162: db1240.s3 index issues.

There was nothing other than rebuilds/optimizes

Thu, Jun 20, 11:36 AM · Patch-For-Review, Data-Persistence-Backup
Marostegui added a comment to T367162: db1240.s3 index issues.

I don't think we've had any since that but I am going to double check

Thu, Jun 20, 9:25 AM · Patch-For-Review, Data-Persistence-Backup
Marostegui updated the task description for T364299: Make rc_id a bigint.
Thu, Jun 20, 6:57 AM · Schema-change-in-production, DBA
Marostegui closed T367857: Switchover s7 master (db1236 -> db1181) as Resolved.

This is done

Thu, Jun 20, 5:25 AM · DBA
Marostegui closed T367857: Switchover s7 master (db1236 -> db1181), a subtask of T364299: Make rc_id a bigint, as Resolved.
Thu, Jun 20, 5:24 AM · Schema-change-in-production, DBA
Marostegui updated the task description for T367857: Switchover s7 master (db1236 -> db1181).
Thu, Jun 20, 5:24 AM · DBA
Marostegui updated the task description for T367857: Switchover s7 master (db1236 -> db1181).
Thu, Jun 20, 5:22 AM · DBA
Marostegui updated the task description for T367857: Switchover s7 master (db1236 -> db1181).
Thu, Jun 20, 5:07 AM · DBA
Marostegui updated the task description for T367857: Switchover s7 master (db1236 -> db1181).
Thu, Jun 20, 5:06 AM · DBA
Marostegui added a comment to T367857: Switchover s7 master (db1236 -> db1181).

Old 400

Thu, Jun 20, 5:04 AM · DBA
Marostegui moved T367857: Switchover s7 master (db1236 -> db1181) from Ready to In progress on the DBA board.
Thu, Jun 20, 5:04 AM · DBA

Wed, Jun 19

Marostegui updated the task description for T364069: Rebuild pagelinks tables.
Wed, Jun 19, 2:42 PM · DBA
Marostegui updated the task description for T364069: Rebuild pagelinks tables.
Wed, Jun 19, 2:36 PM · DBA
Marostegui updated the task description for T367856: Cleanup revision table schema.
Wed, Jun 19, 11:01 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA
Marostegui moved T367632: Drop ipblocks in production from Ready to In progress on the DBA board.
Wed, Jun 19, 11:01 AM · DBA
Marostegui moved T367856: Cleanup revision table schema from Ready to In progress on the DBA board.
Wed, Jun 19, 11:01 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA
Marostegui added a comment to T367778: [wikireplicas] frequent replag spikes in clouddb hosts.

It makes sense it has more load and hence the queries can take longer, as anayltics hosts have larger queries in general, which pile up until they finished or get killed. It is not something strange to see. However, I want to reiterate that when the service was first set up it was agreed that it was a best effort and it was never guaranteed the hosts would have 0 lag.

Wed, Jun 19, 10:22 AM · Data-Services, cloud-services-team (FY2023/2024-Q3-Q4)
Marostegui closed T367851: Include an-redacteddb1001 in zarcillo as Resolved.

This has been done - @BTullis let me know when clouddb1021 is decommissioned so I can remove it from zarcillo

Wed, Jun 19, 9:54 AM · DBA
Marostegui closed T367851: Include an-redacteddb1001 in zarcillo, a subtask of T365453: Bring an-redacteddb1001 into service to replace clouddb1021, as Resolved.
Wed, Jun 19, 9:53 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
Marostegui reopened T365453: Bring an-redacteddb1001 into service to replace clouddb1021 as "Open".

Sorry wrong task

Wed, Jun 19, 9:53 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
Marostegui reopened T365453: Bring an-redacteddb1001 into service to replace clouddb1021, a subtask of T355571: Q#:rack/setup/install an-redacteddb1001, as Open.
Wed, Jun 19, 9:52 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), SRE, ops-eqiad, DC-Ops
Marostegui closed T365453: Bring an-redacteddb1001 into service to replace clouddb1021 as Resolved.
Wed, Jun 19, 9:52 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
Marostegui closed T365453: Bring an-redacteddb1001 into service to replace clouddb1021, a subtask of T355571: Q#:rack/setup/install an-redacteddb1001, as Resolved.
Wed, Jun 19, 9:52 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), SRE, ops-eqiad, DC-Ops
Marostegui added a comment to T367778: [wikireplicas] frequent replag spikes in clouddb hosts.

As suggested by @taavi I tried depooling s1 on clouddb1017, so that all s1 wikireplica traffic will go to the other host (clouddb1013).

Since I made this change, only 1 query was killed by wmf-pt-kill, so my hypothesis is that clouddb1017 is indeed slower than clouddb1013 and both the increased replag and the additional amount of queries being killed that were recorded since June 13th are caused by this slowness and not by specific queries.

I rebooted and upgraded clouddb1017 from mariadb 10.6.16 to mariadb 10.6.17 on June 11th, which I suspect could be related to this issue. clouddb1013 was also rebooted but is still running mariadb 10.6.16.

Wed, Jun 19, 9:47 AM · Data-Services, cloud-services-team (FY2023/2024-Q3-Q4)
Marostegui added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

Based on the above, I would add another edge case:

When round replication is setup, just before and after a dc switchover, both dcs will get the heartbeats from both.

Wed, Jun 19, 9:40 AM · Patch-For-Review, DBA
Marostegui added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

Wed, Jun 19, 9:26 AM · Patch-For-Review, DBA
Marostegui updated the task description for T367856: Cleanup revision table schema.
Wed, Jun 19, 7:36 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA
Marostegui updated the task description for T367856: Cleanup revision table schema.
Wed, Jun 19, 7:34 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA
Marostegui updated the task description for T367856: Cleanup revision table schema.
Wed, Jun 19, 7:33 AM · Schema-change-in-production, Data-Engineering, Data Products, DBA
Marostegui closed T366982: Simple filter restrictions on meta causes Recent Changes to time out as Resolved.

eqiad fixed.

Wed, Jun 19, 7:26 AM · DBA, mariadb-optimizer-bug, Wikimedia-Slow-DB-Query, Growth-Team, MediaWiki-Recent-changes
Marostegui added a comment to T367778: [wikireplicas] frequent replag spikes in clouddb hosts.

As suggested by @taavi I tried depooling s1 on clouddb1017, so that all s1 wikireplica traffic will go to the other host (clouddb1013).

I also temporarily increased --busy-time on clouddb1013 to 10800 (the value it has on clouddb1017, which is the "analytics" host, while clouddb1013 is the "web" host with a lower value of 300). To achieve this I did:

# edit the value in the config file
vi /etc/default/wmf-pt-kill
# restart the unit
systemctl restart wmf-pt-kill@s1
Wed, Jun 19, 7:16 AM · Data-Services, cloud-services-team (FY2023/2024-Q3-Q4)
Marostegui added a comment to T364069: Rebuild pagelinks tables.

s2 pending: db1162

Wed, Jun 19, 7:10 AM · DBA
Marostegui added a project to T367919: Avoid error logging while searching configs during normal operation: Data-Persistence.

Thank you!

Wed, Jun 19, 7:04 AM · Data-Persistence, conftool