The RAID is back to Optimal!
root@db2058:~# hpssacli controller all show config
The RAID is back to Optimal!
root@db2058:~# hpssacli controller all show config
@Papaul has removed and inserted back the disk and it is rebuilding again.
Let's see if it goes fine this time or we have to replace it completely
root@db2043:~# hpssacli controller all show config
Duplicate of T225889
The disk has failed - can we try a different one?
root@db2043:~# hpssacli controller all show config
Sorry, this ^ was for db2043
The disk failed, can we try another one? Thanks!
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Failed)
Thanks!
It is rebuilding
root@db2058:~# hpssacli controller all show config
All the fishbowl wikis are done:
for i in `cat s3_fishbowl | awk -F "." '{print $1}'`; do echo $i; mysql.py -hdb1123 $i -e "show create table oathauth_users\G" | egrep "module|data";done amwikimedia `module` varbinary(255) NOT NULL, `data` blob, cnwikimedia `module` varbinary(255) NOT NULL, `data` blob, donatewiki `module` varbinary(255) NOT NULL, `data` blob, fixcopyrightwiki `module` varbinary(255) NOT NULL, `data` blob, foundationwiki `module` varbinary(255) NOT NULL, `data` blob, hiwikimedia `module` varbinary(255) NOT NULL, `data` blob, idwikimedia `module` varbinary(255) NOT NULL, `data` blob, maiwikimedia `module` varbinary(255) NOT NULL, `data` blob, nostalgiawiki `module` varbinary(255) NOT NULL, `data` blob, punjabiwikimedia `module` varbinary(255) NOT NULL, `data` blob, romdwikimedia `module` varbinary(255) NOT NULL, `data` blob, rswikimedia `module` varbinary(255) NOT NULL, `data` blob, votewiki `module` varbinary(255) NOT NULL, `data` blob, wbwikimedia `module` varbinary(255) NOT NULL, `data` blob,
db1077 is now replicating from db1111 in the test-s4 cluster.
The temporary data has been also removed from dbprov1001
Please mark disk #3 as broken so it doesn't get re-used T226155: Degraded RAID on db2039
This host is scheduled for decommissioning T225988, so no need to act on it. Just label the disk as broken so it doesn't get re-used
And after the reboot the battery fully failed T226154:
Battery/Capacitor Count: 0
And after the reboot the battery fully failed T226154:
Battery/Capacitor Count: 0
This is a known BBU issue: T225981 T225391#5261662
db1112 is now the sanitarium master for s3.
centralauth has been done:
root@cumin1001:/home/marostegui# mysql.py -hdb1090:3317 centralauth -e "show create table oathauth_users\G" | egrep "module|data" `module` varbinary(255) NOT NULL, `data` blob,
Great news! Thanks a lot!!
Thanks for confirming, I just wanted to make sure the planning didn't change, as there have been many migration subtasks and it was hard to keep up with all the plans :-)
I will try to get the db master changed scheduled for July
All the private wikis have been altered:
advisorswiki `module` varbinary(255) NOT NULL, `data` blob, arbcom_cswiki `module` varbinary(255) NOT NULL, `data` blob, arbcom_dewiki `module` varbinary(255) NOT NULL, `data` blob, arbcom_enwiki `module` varbinary(255) NOT NULL, `data` blob, arbcom_fiwiki `module` varbinary(255) NOT NULL, `data` blob, arbcom_nlwiki `module` varbinary(255) NOT NULL, `data` blob, auditcomwiki `module` varbinary(255) NOT NULL, `data` blob, boardgovcomwiki `module` varbinary(255) NOT NULL, `data` blob, boardwiki `module` varbinary(255) NOT NULL, `data` blob, chairwiki `module` varbinary(255) NOT NULL, `data` blob, chapcomwiki `module` varbinary(255) NOT NULL, `data` blob, checkuserwiki `module` varbinary(255) NOT NULL, `data` blob, collabwiki `module` varbinary(255) NOT NULL, `data` blob, ecwikimedia `module` varbinary(255) NOT NULL, `data` blob, electcomwiki `module` varbinary(255) NOT NULL, `data` blob, execwiki `module` varbinary(255) NOT NULL, `data` blob, fdcwiki `module` varbinary(255) NOT NULL, `data` blob, grantswiki `module` varbinary(255) NOT NULL, `data` blob, id_internalwikimedia `module` varbinary(255) NOT NULL, `data` blob, iegcomwiki `module` varbinary(255) NOT NULL, `data` blob, ilwikimedia `module` varbinary(255) NOT NULL, `data` blob, internalwiki `module` varbinary(255) NOT NULL, `data` blob, legalteamwiki `module` varbinary(255) NOT NULL, `data` blob, movementroleswiki `module` varbinary(255) NOT NULL, `data` blob, noboard_chapterswikimedia `module` varbinary(255) NOT NULL, `data` blob, officewiki `module` varbinary(255) NOT NULL, `data` blob, ombudsmenwiki `module` varbinary(255) NOT NULL, `data` blob, otrs_wikiwiki `module` varbinary(255) NOT NULL, `data` blob, projectcomwiki `module` varbinary(255) NOT NULL, `data` blob, searchcomwiki `module` varbinary(255) NOT NULL, `data` blob, spcomwiki `module` varbinary(255) NOT NULL, `data` blob, stewardwiki `module` varbinary(255) NOT NULL, `data` blob, techconductwiki `module` varbinary(255) NOT NULL, `data` blob, transitionteamwiki `module` varbinary(255) NOT NULL, `data` blob, wg_enwiki `module` varbinary(255) NOT NULL, `data` blob, wikimaniateamwiki `module` varbinary(255) NOT NULL, `data` blob, zerowiki `module` varbinary(255) NOT NULL, `data` blob,
As per my chat with @Reedy the code is merged and he's done some testing and it looks good, so I will try to get this schema change done during this week.
This host is no longer a master and will be decommissioned in a few days
db1112 is now cloned from db1077. I am going to let it replicate for 24h before changing sanitarium to replicate from it and to pool it in s3.
Looks fixed then
Yeah, it got some delay:
Wed, Jun 19, 2019 at 10:27 AM (Delivered after 1752 seconds)
Thanks for the update @Lea_Lacroix_WMDE - I remember I talked to @alaa_wmde about waiting until we had the new db primary master in place for wikidata on some tasks. I have only been able to find this T219145#5088395 but I reckon we spoke about it somewhere else too.
By reading T221765 I guess this first migration will be only the 1% of all the items, and the rest are still marked as TBD, right?
Once we have swapped the current db master, we can proceed with the rest of the migration, is that still the plan?
Regardless of the query...the email hasn't arrived yet and the script didn't show any errors. So probably some debugging is needed to check what the email is doing //cc @herron
From the exim logs I see the email being sent correctly though.
I have tested sending an email to myself from the CLI and that arrived correctly, however the script doesn't seem to be working correctly as the email isn't arriving to either wikitech-l or myself (I modified the script to send it to me).
Exim still marks it as correctly being sent on logs.
So if we want the ones from may we need to modify all the queries on that script to make them to pick the right range, not sure if it is worth the time?
I just ran it, but I think it gave the delta between today and 1 month ago as most of the queries are:
So this is the change I will push the 25th of June to change the last key: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/517807/
I will follow the same procedure that was followed for the previous two keys.
The failover was done, so we can probably keep compressing tables.
@jcrespo let me know if you would like to handling this yourself or you want me to take over so you can focus on backups :)
So far everything looks good, so closing this.
This happened successfully.
Read only times (UTC):
This happened successfully.
Read only times (UTC):
Thanks for all the checks!
I will depool db1081 early in the morning, good idea :)
This recovered itself - no more issues since 13th May
Not yet, I haven't seen more errors but I want to wait until icinga alert clears up, let's give it another 24h
This host is ready for DC-Ops to take over and decommission
it is temporary and it won't last more than 2 days, but ok
Assigning this to myself to indicate I am using dbstore1001 for a few days as storing the content of db1112 (test cluster data) temporarily - once I have finished this I will reassign back to Rob
test-cluster users have been notified that on Thursday the replica will go offline to be changed by db1077.
In T225704#5264276, @ayounsi wrote:No differences other than the physical row they're in. They will be able to reach the same resources.
In T225704#5264272, @ayounsi wrote:dbproxy1001-1008 are in the private vlans across row A-B-C, none in D. Is row D private fine for dbproxy1020/1021 or should they be in private-A/B/C ?
dbproxy1020/1021 can go the same vlans as dbproxy1001-1008 as those will be replacing some of those
So, to be clear from my side:
I have deployed this change on db1073 for labswiki and labtestwiki just to have it done there in advance to check if something breaks in the next few days.
Great!
So what do you have in mind?
pc1008 tables optimization finished:
root@pc1008:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 4.4T 2.0T 2.4T 46% /srv
1018 and 1019 are ok to go to cloud VLAN from my side (as they are in row C)
We just need two hosts on that vlan
Yep! Not a problem, I don't mind which hosts as long as we have two on that VLAN, whichever ones work best for you
@Cmjohnson which ones will go in the cloud vlan finally?
1018 and 1019 or 1020 and 1021?
I'm fine either way but I'm confused with your last comment :)
In T225704#5262204, @jcrespo wrote:No name change? I do not mind, just want to make sure it is a conscious decision.
@Cmjohnson I have updated the task with the racking proposal at the beginning.
Thanks!
Thanks! I will update the task accordingly to reflect this discussion on top so it is easier for Chris
In T225704#5262147, @jcrespo wrote:Thinking more, as toolsdb was canibalized by openstack, maybe its potential proxies should too. I guess 2/2 is the safe option right now. Sorry, but I didn't think too much about this in advance. Cloud input would be nice of future service expansion and general load balancing/failover needs.
m5 at the moment doesn't use the proxies (I know it should but they are not being used at the moment) (T202367#5252689)
Assigning this to myself to let Chris know that this is still blocked on DBAs to decide.
So for now 2 of them will go to replace 1010 and 1011 for sure.
In T225391#5261673, @jcrespo wrote:note db1114 was a host we removed from production because it was unstable. I would vote for another. Did you try depooling and forcing a learning cycle?
Also db1114 (test-s1) can be a host we can place instead of db1077 and move db1077 to be test-s1?
db1077 has had its BBU in charging status for around 30h now. I have taken a look at the HW logs and:
Host rebooted. No new logs on HW side.
Some more errors from yesterday evening:
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: It has been corrected by h/w and requires no further action [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: event severity: corrected [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: Error 0, type: corrected [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: fru_text: A1 [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: section_type: memory error [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: error_status: 0x0000000000000400 [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: physical_address: 0x0000003cb28d7e40 [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 504 [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: error_type: 2, single-bit ECC [Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: TSC 119d3ab72dafe6 [Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: ADDR 3cb28d7e40 [Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: MISC 0 [Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560720886 SOCKET 0 APIC 0 [Sun Jun 16 21:33:58 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28d7 offset:0xe40 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1) [Sun Jun 16 21:35:44 2019] mce: [Hardware Error]: Machine check events logged