Page MenuHomePhabricator

Investigate if maintain-replica-indexes is still needed
Closed, ResolvedPublic

Description

After the breakage of clouddb* (T337446) we had to rebuild all the data.
Per @RhinosF1 comment at T337446#8887153 there's this step at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Step_5:_setting_indexes

I tried to run that on clouddb1021:s5 but it failed (I tried --database and --mysql-socket options). I have no context if that is still needed or if it is something old. Can someone from either cloud-services-team or Data-Engineering investigate, make a call and run it if if is needed?

Progress:
s1:

  • clouddb1013
  • clouddb1017
  • clouddb1021

s2:

  • clouddb1014
  • clouddb1018
  • clouddb1021

s3:

  • clouddb1013
  • clouddb1017
  • clouddb1021

s4:

  • clouddb1015
  • clouddb1019
  • clouddb1021

s5:

  • clouddb1016
  • clouddb1020
  • clouddb1021

s6:

  • clouddb1015
  • clouddb1019
  • clouddb1021

s7:

  • clouddb1014
  • clouddb1018
  • clouddb1021

s8:

  • clouddb1016
  • clouddb1020
  • clouddb1021

Event Timeline

Marostegui added a project: DBA.

I am running this on clouddb1021 for now.

I would argue this is high priority if not UBN (but acknowledging that Manuel isn't familiar with this please don't take this as any sort of pressure on you!)

T337446: Rebuild sanitarium hosts is nearly resolved, but for the sections that are rebuilt, we're still finding our tools to be extremely slow. This apparently is because the indexed views like revision_userindex and logging_logindex do not actually have the indexes. This makes tools like XTools Edit Counter more or less useless. I do not know which columns the indexes actually were on, because they never showed up with EXPLAIN or SHOW CREATE TABLE, etc. I hope they are documented somewhere.

Also note the specialized views like actor_revision that are also used for improved performance. I do not know if those need tending to separate from running the maintain-replica-indexes script.

I can take care of it. I think it's not hard to fix. Probably some schema updates have not been reflected on it. Let me talk and see what's going on.

So it actually works:

ladsgroup@clouddb1018:~$ sudo /usr/local/sbin/maintain-replica-indexes --database fawiki --dry-run --mysql-socket /run/mysqld/mysqld.s7.sock
2023-05-31 10:35:42,472 INFO 
    ALTER TABLE fawiki.page_props ADD KEY pp_value_prefix (pp_value(767))
    
2023-05-31 10:35:42,474 INFO 
    ALTER TABLE fawiki.logging ADD KEY log_comment_type (log_comment_id, log_type, log_deleted)
    
2023-05-31 10:35:42,476 INFO 
    ALTER TABLE fawiki.protected_titles ADD KEY pt_reason_id (pt_reason_id)
    
2023-05-31 10:35:42,477 INFO 
    ALTER TABLE fawiki.oldimage ADD KEY oi_description_deleted (oi_description_id, oi_deleted)
    
2023-05-31 10:35:42,479 INFO 
    ALTER TABLE fawiki.ipblocks ADD KEY ipb_reason_deleted (ipb_reason_id, ipb_deleted)
    
2023-05-31 10:35:42,481 INFO 
    ALTER TABLE fawiki.recentchanges ADD KEY rc_comment_deleted (rc_comment_id, rc_deleted)
    
2023-05-31 10:35:42,483 INFO 
    ALTER TABLE fawiki.filearchive ADD KEY fa_description_deleted (fa_description_id, fa_deleted)
    
2023-05-31 10:35:42,484 INFO 
    ALTER TABLE fawiki.filearchive ADD KEY fa_reason (fa_deleted_reason_id)
    
2023-05-31 10:35:42,486 INFO 
    ALTER TABLE fawiki.revision_comment_temp ADD KEY revcomment_comment_id (revcomment_comment_id)
    
2023-05-31 10:35:42,488 INFO 
    ALTER TABLE fawiki.image ADD KEY img_description_id (img_description_id)
    
2023-05-31 10:35:42,490 INFO 
    ALTER TABLE fawiki.slots ADD KEY slot_content_id (slot_content_id)
    
2023-05-31 10:35:42,492 INFO 
    ALTER TABLE fawiki.archive ADD KEY ar_actor_deleted (ar_actor, ar_deleted)
    
2023-05-31 10:35:42,493 INFO 
    ALTER TABLE fawiki.ipblocks ADD KEY ipb_actor_deleted (ipb_by_actor, ipb_deleted)
    
2023-05-31 10:35:42,495 INFO 
    ALTER TABLE fawiki.oldimage ADD KEY oi_actor_deleted (oi_actor, oi_deleted)
    
2023-05-31 10:35:42,498 INFO 
    ALTER TABLE fawiki.filearchive ADD KEY fa_actor_deleted (fa_actor, fa_deleted)
    
2023-05-31 10:35:42,500 INFO 
    ALTER TABLE fawiki.recentchanges ADD KEY rc_actor_deleted (rc_actor, rc_deleted)
    
2023-05-31 10:35:42,502 INFO 
    ALTER TABLE fawiki.logging ADD KEY log_actor_deleted (log_actor, log_deleted)

but it has two major issues:

When you call it with --help or anything that it can't understand, it breaks in a spectacular fashion:

ladsgroup@clouddb1018:~$ sudo python3 /usr/local/sbin/maintain-replica-indexes --help
Traceback (most recent call last):
  File "/usr/local/sbin/maintain-replica-indexes", line 210, in <module>
    main()
  File "/usr/local/sbin/maintain-replica-indexes", line 132, in main
    args = parser.parse_args()
  File "/usr/lib/python3.9/argparse.py", line 1830, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 1863, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/usr/lib/python3.9/argparse.py", line 2072, in _parse_known_args
    start_index = consume_optional(start_index)
  File "/usr/lib/python3.9/argparse.py", line 2012, in consume_optional
    take_action(action, args, option_string)
  File "/usr/lib/python3.9/argparse.py", line 1940, in take_action
    action(self, namespace, argument_values, option_string)
  File "/usr/lib/python3.9/argparse.py", line 1104, in __call__
    parser.print_help()
  File "/usr/lib/python3.9/argparse.py", line 2560, in print_help
    self._print_message(self.format_help(), file)
  File "/usr/lib/python3.9/argparse.py", line 2544, in format_help
    return formatter.format_help()
  File "/usr/lib/python3.9/argparse.py", line 295, in format_help
    help = self._root_section.format_help()
  File "/usr/lib/python3.9/argparse.py", line 226, in format_help
    item_help = join([func(*args) for func, args in self.items])
  File "/usr/lib/python3.9/argparse.py", line 226, in <listcomp>
    item_help = join([func(*args) for func, args in self.items])
  File "/usr/lib/python3.9/argparse.py", line 226, in format_help
    item_help = join([func(*args) for func, args in self.items])
  File "/usr/lib/python3.9/argparse.py", line 226, in <listcomp>
    item_help = join([func(*args) for func, args in self.items])
  File "/usr/lib/python3.9/argparse.py", line 542, in _format_action
    help_text = self._expand_help(action)
  File "/usr/lib/python3.9/argparse.py", line 638, in _expand_help
    return self._get_help_string(action) % params
ValueError: unsupported format character 'w' (0x77) at index 41

and when you run it with dry-run, it's quite quiet:

ladsgroup@clouddb1018:~$ sudo /usr/local/sbin/maintain-replica-indexes --database fawiki --mysql-socket /run/mysqld/mysqld.s7.sock
ladsgroup@clouddb1018:~$

But it works?

root@clouddb1018:s7[fawiki]> show indexes from logging;
+---------+------------+---------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name            | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+---------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| logging |          0 | PRIMARY             |            1 | log_id         | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_page_id_time    |            1 | log_page       | A         |     5926222 |     NULL | NULL   | YES  | BTREE      |         |               |
| logging |          1 | log_page_id_time    |            2 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_type_time |            1 | log_actor      | A         |     2963111 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_type_time |            2 | log_type       | A         |     2963111 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_type_time |            3 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_type_action     |            1 | log_type       | A         |        7933 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_type_action     |            2 | log_action     | A         |       12827 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_type_action     |            3 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_type_time       |            1 | log_type       | A         |        8208 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_type_time       |            2 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_time      |            1 | log_actor      | A         |     2963111 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_time      |            2 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_page_time       |            1 | log_namespace  | A         |        3272 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_page_time       |            2 | log_title      | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_page_time       |            3 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_times           |            1 | log_timestamp  | A         |     5926222 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_comment_type    |            1 | log_comment_id | A         |     1481555 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_comment_type    |            2 | log_type       | A         |     1481555 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_comment_type    |            3 | log_deleted    | A         |     1481555 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_deleted   |            1 | log_actor      | A         |     2963111 |     NULL | NULL   |      | BTREE      |         |               |
| logging |          1 | log_actor_deleted   |            2 | log_deleted    | A         |     2963111 |     NULL | NULL   |      | BTREE      |         |               |
+---------+------------+---------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
22 rows in set (0.000 sec)

I probably need to make this also idempotent

I was running it but it just broke, I am not sure if by running it twice there was a race condition but my execution broke @Ladsgroup

root@clouddb1021:~#  /usr/local/sbin/maintain-replica-indexes
Traceback (most recent call last):
  File "/usr/local/sbin/maintain-replica-indexes", line 210, in <module>
    main()
  File "/usr/local/sbin/maintain-replica-indexes", line 182, in main
    write_index(
  File "/usr/local/sbin/maintain-replica-indexes", line 94, in write_index
    cursor.execute(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 684, in _read_packet
    packet.check_error()
  File "/usr/lib/python3/dist-packages/pymysql/protocol.py", line 220, in check_error
    err.raise_mysql_exception(self._data)
  File "/usr/lib/python3/dist-packages/pymysql/err.py", line 109, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.InternalError: (1072, "Key column 'log_comment_id' doesn't exist in table")

I was running it on clouddb1018 so it should not affect you. I take a look.

I would like to see involvement from either cloud-services-team and/or Data-Engineering as they are likely to know more things than we do, and we are sort of doing some ad-hoc reverse engineering.
For now, let's stick to clouddb1021 and let's leave the rest of the host aside until we are sure about how this works and how long this take (it is likely to take several hours per run)

Marostegui raised the priority of this task from Medium to High.May 31 2023, 10:47 AM

I have a rather educated guess why it's erroring out. The way we do schema changes is by looking up dblists per section (e.g. https://noc.wikimedia.org/conf/dblists/s7.dblist) but the code here tries to get any database that has the given table (by querying information_schema). The problem is that we have a couple of "deleted" wikis (not closed, deleted) that their database is still sitting somewhere lingering around and they are not getting any schema changes. See T297297: Investigate the unusual dbs in s3

Two ways to fix this:

  • Make the update index lookup dblist from noc, that's rather easy to do
  • Drop databases that are lingering from sanitarium. (I prefer dropping them from everywhere but I understand why that could cause issues).

These are not mutually exclusive though. I work on the first one now.

Change 924923 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] wikireplicas: Only try to update section databases in maintain_replica_indexes.py

https://gerrit.wikimedia.org/r/924923

Yup, with the patch works like a charm:

...    
2023-05-31 12:57:55,295 DEBUG SHOW INDEX from afwiki.revision_comment_temp
2023-05-31 12:57:55,296 INFO 
    ALTER TABLE afwiki.revision_comment_temp ADD KEY revcomment_comment_id (revcomment_comment_id)
...

Change 924923 merged by Ladsgroup:

[operations/puppet@production] wikireplicas: Only add index in dbs defined in sections dblist

https://gerrit.wikimedia.org/r/924923

Since it's fixed, running it on clouddb1013 now

And it's failing waiting for table lock. Someone need to depool clouddb1013

Change 924995 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] Depool half of wikireplicas

https://gerrit.wikimedia.org/r/924995

Change 924995 merged by Ladsgroup:

[operations/puppet@production] Depool half of wikireplicas

https://gerrit.wikimedia.org/r/924995

Aaand it had another issue making it really hard to do the schema changes: It doesn't stop replication and thus it really can't get a lock in busy wikis such as enwiki.

@Ladsgroup clouddb1013:3311 has replication stopped, was that you I guess? Just double checking

yeah, the ALTERs couldn't go through because of innodb lock the script sets. I stopped replication and alters are happening, once done, I clean it up.

Ladsgroup updated the task description. (Show Details)

had to the same on clouddb1015:3314 now

Need to re-run it again to take https://gerrit.wikimedia.org/r/c/operations/puppet/+/925547 into account

Ongoing in clouddb1015 and clouddb1021

And clouddb1013

and clouddb1014

and clouddb1016

This is going to alter revision table in enwiki and wikidatawiki. It'll take a while to finish.

Wikidatawiki shouldn't be touched, s8 has not been part of the breakage. So we can skip it

Wikidatawiki shouldn't be touched, s8 has not been part of the breakage. So we can skip it

Yeah but as part of T215466: Remove revision_comment_temp and revision_actor_temp we need to redo the views and add an extra index everywhere. So while half of the cluster is depooled, let's do it to avoid needing to do the whole thing again next week.

Change 925750 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] Repool half of wikireplicas, depool the other half

https://gerrit.wikimedia.org/r/925750

Change 925750 merged by Ladsgroup:

[operations/puppet@production] Repool half of wikireplicas, depool the other half

https://gerrit.wikimedia.org/r/925750

So I repooled the ones that have been maintained and have the indexes (and no view on rev_comment) and depooled the rest to do the maint work. This means things should be fixed for users now. @MusikAnimal can you confirm?

I just tested my Quarry queries that were failing or taking hours when I last tested 3 hours ago and now all complete sub minute as expected. Also https://xtools.wmcloud.org/ec is now back working.

Not sure if you are aware, but CopyPatrol is still not working.

https://copypatrol.toolforge.org/en

So I repooled the ones that have been maintained and have the indexes (and no view on rev_comment) and depooled the rest to do the maint work. This means things should be fixed for users now. @MusikAnimal can you confirm?

Yes! Things seem to be working great again, and I see the flood of errors from XTools has more or less stopped. Thank you @Ladsgroup and @Marostegui!!

Everything is done, tomorrow I will repool them back.

Change 926453 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] wikireplicas: Repool clouddb10[17-20]

https://gerrit.wikimedia.org/r/926453

So I repooled the ones that have been maintained and have the indexes (and no view on rev_comment) and depooled the rest to do the maint work. This means things should be fixed for users now. @MusikAnimal can you confirm?

Yes! Things seem to be working great again, and I see the flood of errors from XTools has more or less stopped. Thank you @Ladsgroup and @Marostegui!!

Awesome. Glad it's back. I'm repooling the other half now. Let me know if things break.

Change 926453 merged by Ladsgroup:

[operations/puppet@production] wikireplicas: Repool clouddb10[17-20]

https://gerrit.wikimedia.org/r/926453

Ladsgroup claimed this task.
Ladsgroup moved this task from In progress to Done on the DBA board.