1G is fine, thanks!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Thu, Jun 27
Wed, Jun 26
In T367778#9925706, @fnegri wrote:That query is pretty expensive and it is basically scanning 146M rows, which is unlikely it'll be finish before the 10800 mark.
A long-running query should not automatically cause replication lag, though. Why is this query causing the lag? Is it locking some tables and the replication thread has to wait? Replication is not entirely stuck, it's just slower apparently.
I've installed 10.11 on db2136 (s4) for now. I've pooled in in production for a couple of hours to capture queries that would take longer than 10 seconds to run. For now the host is depooled again and will only be pooled during certain working hours.
Just for the record, I have been investigating the current lag on clouddb1019:3314 - it is because of this:
root@clouddb1019.eqiad.wmnet[commonswiki]> show explain for 3790361; +------+--------------------+---------------+-------+-------------------------------------------------------+--------------------+---------+------------------------------+-----------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+--------------------+---------------+-------+-------------------------------------------------------+--------------------+---------+------------------------------+-----------+--------------------------+ | 1 | PRIMARY | page | ALL | page_name_title,page_redirect_namespace_len | NULL | NULL | NULL | 146173850 | Using where | | 3 | MATERIALIZED | categorylinks | range | PRIMARY,cl_timestamp,cl_sortkey | cl_timestamp | 257 | NULL | 2 | Using where; Using index | | 2 | MATERIALIZED | linktarget | range | PRIMARY,lt_namespace_title | lt_namespace_title | 261 | NULL | 12 | Using where; Using index | | 2 | MATERIALIZED | templatelinks | ref | PRIMARY,tl_target_id,tl_backlinks_namespace_target_id | tl_target_id | 8 | commonswiki.linktarget.lt_id | 92 | Using index | | 8 | DEPENDENT SUBQUERY | pagelinks | ref | pl_target_id | pl_target_id | 8 | commonswiki.linktarget.lt_id | 6 | Using index | | 7 | DEPENDENT SUBQUERY | templatelinks | ref | tl_target_id | tl_target_id | 8 | commonswiki.linktarget.lt_id | 92 | Using index | +------+--------------------+---------------+-------+-------------------------------------------------------+--------------------+---------+------------------------------+-----------+--------------------------+ 6 rows in set, 1 warning (0.030 sec)
So in terms of data, my recap is:
- root password is differentr from production
- the data that is present there is sanitized and there's not data there that cannot be queried publicly. Although there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine
- replication user password I don't think it is such a big deal as the risk of having someone to set up a new replica directly from production is minimum and there would be lots of others things that would need to be done for that to be successful.
- wikiuser and wikiadmin are no longer there.
- non-public data (such as suppressed edits or bans) are possibly available - but I don't know enough MW to be able to say if this is still doable or not @Ladsgroup would you know?
Dropping it in s3, which will take around 8 hours.
In T368494#9924824, @gerritbot wrote:Change #1049664 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] m2 proxies: Test db1228
This is done, I will track the master switchover in a different track
Tue, Jun 25
I will try - but just in case @ABran-WMF please take some notes!
In T368136#9919314, @bd808 wrote:What sort of data y'all are concerned about exposing to new roots on the replica db hosting nodes themselves? These boxes already expose less data than are exposed to those in the deployment, restricted, or analytics-privatedata-users groups by virtue of the Sanatarium sanitization steps that are semi-obviously not present on the un-redacted source hosts those users can reach with various query mechanisms.
1 | *************************** 1. row *************************** |
---|---|
2 | Slave_IO_State: |
3 | Master_Host: db1195.eqiad.wmnet |
4 | Master_User: repl2024 |
5 | Master_Port: 3306 |
6 | Connect_Retry: 60 |
7 | Master_Log_File: db1195-bin.000533 |
8 | Read_Master_Log_Pos: 451132773 |
9 | Relay_Log_File: db1217-relay-bin.000097 |
10 | Relay_Log_Pos: 451133073 |
11 | Relay_Master_Log_File: db1195-bin.000533 |
12 | Slave_IO_Running: No |
13 | Slave_SQL_Running: No |
14 | Replicate_Do_DB: |
15 | Replicate_Ignore_DB: |
16 | Replicate_Do_Table: |
17 | Replicate_Ignore_Table: |
18 | Replicate_Wild_Do_Table: |
19 | Replicate_Wild_Ignore_Table: |
20 | Last_Errno: 0 |
21 | Last_Error: |
22 | Skip_Counter: 0 |
23 | Exec_Master_Log_Pos: 451132773 |
24 | Relay_Log_Space: 451133431 |
25 | Until_Condition: None |
26 | Until_Log_File: |
27 | Until_Log_Pos: 0 |
28 | Master_SSL_Allowed: Yes |
29 | Master_SSL_CA_File: |
30 | Master_SSL_CA_Path: |
31 | Master_SSL_Cert: |
32 | Master_SSL_Cipher: |
33 | Master_SSL_Key: |
34 | Seconds_Behind_Master: NULL |
35 | Master_SSL_Verify_Server_Cert: No |
36 | Last_IO_Errno: 0 |
37 | Last_IO_Error: |
38 | Last_SQL_Errno: 0 |
39 | Last_SQL_Error: |
40 | Replicate_Ignore_Server_Ids: |
41 | Master_Server_Id: 172001292 |
42 | Master_SSL_Crl: |
43 | Master_SSL_Crlpath: |
44 | Using_Gtid: Slave_Pos |
45 | Gtid_IO_Pos: 171970595-171970595-143176442,0-171970569-1006906062,171970573-171970573-2371267,171970746-171970746-135654591,171978772-171978772-139568865,171966512-171966512-126314222,171970569-171970569-156638323,171966678-171966678-232525439,171970778-171970778-108155596,171970636-171970636-23122305,172001292-172001292-344164559 |
46 | Replicate_Do_Domain_Ids: |
47 | Replicate_Ignore_Domain_Ids: |
48 | Parallel_Mode: optimistic |
49 | SQL_Delay: 0 |
50 | SQL_Remaining_Delay: NULL |
51 | Slave_SQL_Running_State: |
52 | Slave_DDL_Groups: 0 |
53 | Slave_Non_Transactional_Groups: 0 |
54 | Slave_Transactional_Groups: 48583481 |
In T365995#9883497, @jcrespo wrote:backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.
db1204 is the main mediabackups metadata db- ideally, media backups are stopped while maintenance happens and restored afterwards to avoid errors.
old 300
No errors in db1169 for a week
Sun, Jun 23
If it was stopped correctly it should be fine to start again and resume replication too
I didn't bring anything back up as I wasn't aware of what was going on
Sat, Jun 22
From what I can see this was part of T367648
The downtime message above is wrong, it was downtimed cause it crashed :)
Fri, Jun 21
@BTullis can you double check why an-redacteddb1001 isn't having check_private_data runs every day like clouddb1021 has? I detected it because it doesn't have the logs:
In T344599#9910036, @fnegri wrote:I think that members of wmcs-roots can now circumvent this by using the cloudcumin hosts, and run a command as root through Cumin.
I also discovered that a new group wikireplica-roots was created recently by @MoritzMuehlenhoff. At the moment it only includes @joanna_borun but it can be used if somebody else needs root access to clouddb hosts.
I would still like to give root access to clouddb* to everybody in wmcs-roots (which at the moment only includes WMF staff but could potentially include volunteers). In this way we can envision a future where WMCS SREs don't need global root at all.
I'll try recapping the concerns that were listed in the discussion above:
In T196366#9911494, @Ladsgroup wrote:@Marostegui To get the list of direct replicas, something like this would work in cumin:
1 import argparse 2 import json 3 4 import requests 5 6 parser = argparse.ArgumentParser() 7 parser.add_argument('section', help='Must be the section name in orchestrator') 8 args = parser.parse_args() 9 data_ = requests.get( 10 'https://orchestrator.wikimedia.org/api/cluster/alias/' + 11 args.section).json() 12 db_data = [] 13 for db in data_: 14 analyzed_db = {} 15 if db['MasterKey']['Hostname'] + ':' + \ 16 str(db['MasterKey']['Port']) != db['ClusterName']: 17 # not a direct replica 18 continue 19 db_data.append(db['Key']['Hostname'] + ':' + str(db['Key']['Port'])) 20 21 print('direct replicas') 22 for db in db_data: 23 print(json.dumps(db)) Which outputs something like this:
ladsgroup@cumin1002:~/ladsgroup/software2/dbtools$ python3 direct_replicas.py s2 direct replicas "db1156.eqiad.wmnet:3306" "db1182.eqiad.wmnet:3306" "db1188.eqiad.wmnet:3306" "db1197.eqiad.wmnet:3306" "db1222.eqiad.wmnet:3306" "db1225.eqiad.wmnet:3312" "db1229.eqiad.wmnet:3306" "db1233.eqiad.wmnet:3306" "db1239.eqiad.wmnet:3312" "db1246.eqiad.wmnet:3306" "db2207.codfw.wmnet:3306" "dbstore1007.eqiad.wmnet:3312"(it only works from cumin)
We can make it find direct replicas for secondary dc too. The hardest part is to refactor db-switchover to take that list (in itself it's not hard, it's that it assumes in many places that the old master is reachable which is a good assumption for the main usecase). Maybe I should copy paste that into a new file and see what happens.
Thu, Jun 20
There was nothing other than rebuilds/optimizes
I don't think we've had any since that but I am going to double check
This is done
Old 400
Wed, Jun 19
It makes sense it has more load and hence the queries can take longer, as anayltics hosts have larger queries in general, which pile up until they finished or get killed. It is not something strange to see. However, I want to reiterate that when the service was first set up it was agreed that it was a best effort and it was never guaranteed the hosts would have 0 lag.
This has been done - @BTullis let me know when clouddb1021 is decommissioned so I can remove it from zarcillo
Sorry wrong task
In T367778#9906394, @fnegri wrote:As suggested by @taavi I tried depooling s1 on clouddb1017, so that all s1 wikireplica traffic will go to the other host (clouddb1013).
Since I made this change, only 1 query was killed by wmf-pt-kill, so my hypothesis is that clouddb1017 is indeed slower than clouddb1013 and both the increased replag and the additional amount of queries being killed that were recorded since June 13th are caused by this slowness and not by specific queries.
I rebooted and upgraded clouddb1017 from mariadb 10.6.16 to mariadb 10.6.17 on June 11th, which I suspect could be related to this issue. clouddb1013 was also rebooted but is still running mariadb 10.6.16.
In T367278#9906348, @jcrespo wrote:In T367278#9906323, @Marostegui wrote:In T367278#9906299, @jcrespo wrote:I don't see a clear difference with the current icinga/perl implementation.
In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..
That hasn't changed - both masters write their own heartbeats and they get replicated.
If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeatsIf codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeatsBased on the above, I would add another edge case:
When round replication is setup, just before and after a dc switchover, both dcs will get the heartbeats from both.
In T367278#9906299, @jcrespo wrote:I don't see a clear difference with the current icinga/perl implementation.
In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..
eqiad fixed.
In T367778#9904187, @fnegri wrote:As suggested by @taavi I tried depooling s1 on clouddb1017, so that all s1 wikireplica traffic will go to the other host (clouddb1013).
I also temporarily increased --busy-time on clouddb1013 to 10800 (the value it has on clouddb1017, which is the "analytics" host, while clouddb1013 is the "web" host with a lower value of 300). To achieve this I did:
# edit the value in the config file vi /etc/default/wmf-pt-kill # restart the unit systemctl restart wmf-pt-kill@s1
s2 pending: db1162