Page MenuHomePhabricator

Convert unique keys into primary keys for some wiki tables on s7
Closed, ResolvedPublic

Description

Pending hosts:

codfw - entire DC

./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=categorylinks "add primary key (cl_from,cl_to)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=categorylinks "drop key cl_from" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=templatelinks "add primary key (tl_from,tl_namespace,tl_title)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=pagelinks "add primary key (pl_from,pl_namespace,pl_title)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=text "drop key old_id, add primary key (old_id)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=imagelinks "drop key il_from, add primary key (il_from,il_to)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=iwlinks "drop key iwl_from, add primary key (iwl_from,iwl_prefix,iwl_title)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=langlinks "drop key ll_from, add primary key (ll_from,ll_lang)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=log_search "drop key ls_field_val, add primary key (ls_field,ls_value,ls_log_id)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=math "drop key math_inputhash, add primary key (math_inputhash)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=module_deps "drop key md_module_skin, add primary key (md_module,md_skin)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=objectcache "drop key keyname, add primary key (keyname)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=querycache_info "drop key  qci_type, add primary key (qci_type)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=site_stats "drop key ss_row_id, add primary key (ss_row_id)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=transcache "drop key tc_url_idx, add primary key (tc_url)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=user_former_groups "drop key ufg_user_group, add primary key (ufg_user,ufg_group)" --method=ddl --no-replicate
./software/dbtools/osc_host.sh --host=$host.eqiad.wmnet --dblist=$shard.dblist --table=user_properties "drop key user_properties_user_property, add primary key (up_user,up_property)" --method=ddl --no-replicate

eqiad: all hosts except db1062 (master) and db1094

  • labsdb1001
  • labsdb1003
  • db1069 (Will not be done as this will be decommissioned soon)
  • dbstore1001
  • dbstore1002
  • db1028
  • db1033
  • db1034
  • db1039
  • db1041
  • db1079
  • db1086
  • db1094
  • db1062

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-06-21T13:22:21Z] <marostegui> Deploy alter table on s7 - directly on codfw master (db2029) - this will generate lag on codfw - T166208

Change 361028 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add coment to db1041 running alter

https://gerrit.wikimedia.org/r/361028

Change 361028 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add coment to db1041 running alter

https://gerrit.wikimedia.org/r/361028

Mentioned in SAL (#wikimedia-operations) [2017-06-23T06:15:18Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add comments to db1041 long running alter status - T166208 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-06-23T06:17:22Z] <marostegui> Deploy alter table on db1041 - s7 - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-26T06:26:04Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove comments from db1041 long running alter status - T166208 (duration: 00m 47s)

Change 361399 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1086

https://gerrit.wikimedia.org/r/361399

Change 361399 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1086

https://gerrit.wikimedia.org/r/361399

Mentioned in SAL (#wikimedia-operations) [2017-06-26T06:35:11Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1086 - T166208 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-06-26T06:36:54Z] <marostegui> Deploy alter table s7 - db1086 - T166208

Change 361452 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add comments to db1033

https://gerrit.wikimedia.org/r/361452

Change 361452 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add comments to db1033

https://gerrit.wikimedia.org/r/361452

Mentioned in SAL (#wikimedia-operations) [2017-06-26T13:48:29Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add comments to db1033 status - T166208 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2017-06-26T13:49:55Z] <marostegui> Deploy alter table s7 - db1033 - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-26T15:33:51Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1086 - T166208 (duration: 00m 46s)

Change 361472 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/361472

Change 361472 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/361472

Mentioned in SAL (#wikimedia-operations) [2017-06-26T15:38:01Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T166208 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-06-26T15:41:10Z] <marostegui> Deploy alter table s7 - db1079 - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-27T05:04:26Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 - T166208 (duration: 00m 43s)

Change 361627 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1034

https://gerrit.wikimedia.org/r/361627

Change 361627 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1034

https://gerrit.wikimedia.org/r/361627

Mentioned in SAL (#wikimedia-operations) [2017-06-27T06:47:49Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1034 - T166208 (duration: 00m 43s)

cloud-services-team I am going to start the alter table on labsdb1001, so s7 will get delayed around a day.

Mentioned in SAL (#wikimedia-operations) [2017-06-27T06:48:07Z] <marostegui> Deploy alter table s7 on labsdb1001 - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-27T07:11:54Z] <marostegui> Deploy alter table db1034 - T166208

db1034 seems to be having physical disk issues-lots of errors (I have not created a ticket yet). Consider not pooling it back again, or at least not without a second look and maybe converting an alternative host.

Oh - thanks for the heads up. I will take a look tomorrow

Mentioned in SAL (#wikimedia-operations) [2017-06-28T05:24:48Z] <marostegui> Stop MySQL and reboot db1034 for maintenance - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-28T05:27:35Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove comments from db1033 status - T166208 (duration: 00m 47s)

I have rebooted db1034 and it came back fine, so did MySQL.
RAID is fine and HW logs only show some disk errors from a month ago (I assume the timestamp is correct):

		CreationTimestamp = 20160525140208.000000+000
		ElementName = System Event Log Entry
		RecordData = *2*Storage Drive 5: Drive Slot sensor for Storage, drive presence was asserted

I have started the ALTER tables again and this is the current status after the reboot:

root@db1034:~# megacli -PDList -aALL | grep Error
Media Error Count: 0
Other Error Count: 1
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0

We will see if it increases.

Enclosure Device ID: 32
Slot Number: 5
Drive's position: DiskGroup: 0, Span: 2, Arm: 1
Enclosure position: N/A
Device Id: 5
WWN: 5000C5003240B064
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 68328

PD Type: SAS

Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Sector Size: 0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c5003240b065
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3300657SS ES646SJ0GQZJ
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :37C (98.60 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : Yes

However there is another disk having errors, and not that one:

root@db1034:~# megacli -PDList -aALL | grep Error
Media Error Count: 0
Other Error Count: 65
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A
Device Id: 0
WWN: 5000C5003240E284
Sequence Number: 2
Media Error Count: 0
Other Error Count: 65
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Interesting that the errors have not increased for the disk reported on: T166208#3385167 maybe those were generated at: 20160525140208.000000+000 (as per the HW logs)

Enclosure Device ID: 32
Slot Number: 5
Drive's position: DiskGroup: 0, Span: 2, Arm: 1
Enclosure position: N/A
Device Id: 5
WWN: 5000C5003240B064
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 68328
PD Type: SAS

My report came after the RAID check timed out, which made me worry- check history of alerts on icinga.

Unfortunately we do not have any history from yesterday, but the host was indeed having issues as can be seen on the log until MySQL crashed at midnight:

InnoDB: Warning: a long semaphore wait:
--Thread 140537676445440 has waited at fsp0fsp.cc line 2862 for 241.00 seconds the semaphore:
Mutex at 0x7fdef6bfe8e8 '&dict_sys->mutex', lock var 1
waiters flag 1
InnoDB: Warning: semaphore wait:

<snip>

And at 00:35 it crashed:

----------------------------
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2017-06-28 00:35:42 7fd209ffd700  InnoDB: Assertion failure in thread 140540087621376 in file srv0srv.cc line 2200
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
170628  0:35:42 [ERROR] mysqld got signal 6 ;

Change 362139 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1028

https://gerrit.wikimedia.org/r/362139

Change 362139 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1028

https://gerrit.wikimedia.org/r/362139

Mentioned in SAL (#wikimedia-operations) [2017-06-29T07:10:07Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1028 - T166208 (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-06-29T07:13:45Z] <marostegui> Deploy alter table on s7 - db1028 - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-29T14:08:51Z] <marostegui> Deploy alter table on s7 on dbstore1001 - T166208

Mentioned in SAL (#wikimedia-operations) [2017-06-30T13:39:56Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1028 - T166208 (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-07-03T07:35:37Z] <marostegui> Drop alter table s7 - labsdb1003 - T166208

cloud-services-team I have started the alter table on labsdb1003, so s7 replication thread will be delayed for around 48h.

Mentioned in SAL (#wikimedia-operations) [2017-07-03T07:40:52Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1034 - T166208 (duration: 03m 00s)

Mentioned in SAL (#wikimedia-operations) [2017-07-03T07:51:36Z] <marostegui> Deploy alter table db1039 - s7 - T166208

Change 362947 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add comments to db1039 status

https://gerrit.wikimedia.org/r/362947

Change 362947 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add comments to db1039 status

https://gerrit.wikimedia.org/r/362947

Mentioned in SAL (#wikimedia-operations) [2017-07-03T08:02:36Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add comments about db1039 status - T166208 (duration: 02m 49s)

Mentioned in SAL (#wikimedia-operations) [2017-07-04T11:47:52Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove comments from db1039 status - T166208 (duration: 02m 50s)

Marostegui updated the task description. (Show Details)

Everything is done