Page MenuHomePhabricator
Feed Advanced Search

Jun 20 2019

Marostegui closed T225902: Degraded RAID on db2058 as Resolved.

The RAID is back to Optimal!

root@db2058:~# hpssacli controller all show config
Jun 20 2019, 3:26 PM · DBA, SRE, ops-codfw
Marostegui added a comment to T225889: Degraded RAID on db2043.

@Papaul has removed and inserted back the disk and it is rebuilding again.
Let's see if it goes fine this time or we have to replace it completely

root@db2043:~#  hpssacli controller all show config
Jun 20 2019, 2:19 PM · DBA, SRE, ops-codfw
Marostegui closed T226186: Degraded RAID on db2043 as Declined.

Duplicate of T225889

Jun 20 2019, 2:09 PM · SRE, ops-codfw
Marostegui reassigned T225889: Degraded RAID on db2043 from Marostegui to Papaul.

The disk has failed - can we try a different one?

root@db2043:~#  hpssacli controller all show config
Jun 20 2019, 2:07 PM · DBA, SRE, ops-codfw
Marostegui claimed T225902: Degraded RAID on db2058.
Jun 20 2019, 2:07 PM · DBA, SRE, ops-codfw
Marostegui added a comment to T225902: Degraded RAID on db2058.

Sorry, this ^ was for db2043

Jun 20 2019, 2:06 PM · DBA, SRE, ops-codfw
Marostegui reassigned T225902: Degraded RAID on db2058 from Marostegui to Papaul.

The disk failed, can we try another one? Thanks!

physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Failed)
Jun 20 2019, 2:05 PM · DBA, SRE, ops-codfw
Marostegui added a comment to T225902: Degraded RAID on db2058.

Thanks!
It is rebuilding

root@db2058:~# hpssacli controller all show config
Jun 20 2019, 2:04 PM · DBA, SRE, ops-codfw
Marostegui closed T225643: Schema change to oathauth_users as Resolved.

All the fishbowl wikis are done:

for i in `cat s3_fishbowl  | awk -F "." '{print $1}'`; do echo $i; mysql.py -hdb1123 $i -e "show create table oathauth_users\G" | egrep "module|data";done
amwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
cnwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
donatewiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
fixcopyrightwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
foundationwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
hiwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
idwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
maiwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
nostalgiawiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
punjabiwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
romdwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
rswikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
votewiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
wbwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
Jun 20 2019, 10:40 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui updated the task description for T225643: Schema change to oathauth_users.
Jun 20 2019, 10:39 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui closed T225981: Replace db1077 with db1112 as Resolved.

db1077 is now replicating from db1111 in the test-s4 cluster.
The temporary data has been also removed from dbprov1001

Jun 20 2019, 9:28 AM · DBA
Marostegui added a comment to T225988: decommission db2039.

Please mark disk #3 as broken so it doesn't get re-used T226155: Degraded RAID on db2039

Jun 20 2019, 7:46 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui updated the task description for T225988: decommission db2039.
Jun 20 2019, 7:46 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui closed T226155: Degraded RAID on db2039 as Declined.

This host is scheduled for decommissioning T225988, so no need to act on it. Just label the disk as broken so it doesn't get re-used

Jun 20 2019, 7:44 AM · SRE, ops-codfw
Marostegui added a comment to T225981: Replace db1077 with db1112.

And after the reboot the battery fully failed T226154:

Battery/Capacitor Count: 0
Jun 20 2019, 7:41 AM · DBA
Marostegui added a comment to T225391: db1077 crashed.

And after the reboot the battery fully failed T226154:

Battery/Capacitor Count: 0
Jun 20 2019, 7:41 AM · ops-eqiad, DBA, SRE
Marostegui closed T226154: Degraded RAID on db1077 as Declined.

This is a known BBU issue: T225981 T225391#5261662

Jun 20 2019, 7:38 AM · ops-eqiad, SRE
Marostegui added a comment to T225981: Replace db1077 with db1112.

db1112 is now the sanitarium master for s3.

Jun 20 2019, 6:10 AM · DBA
Marostegui updated the task description for T225643: Schema change to oathauth_users.
Jun 20 2019, 5:39 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui added a comment to T225643: Schema change to oathauth_users.

centralauth has been done:

root@cumin1001:/home/marostegui# mysql.py -hdb1090:3317 centralauth -e "show create table oathauth_users\G" | egrep "module|data"
  `module` varbinary(255) NOT NULL,
  `data` blob,
Jun 20 2019, 5:39 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui triaged T179884: Files occasionally getting uploaded to Commons without file pages. as High priority.
Jun 20 2019, 5:33 AM · Multimedia, UploadWizard, SRE-swift-storage, Commons

Jun 19 2019

Marostegui added a comment to T222731: Storage problems with new host db1133.

Great news! Thanks a lot!!

Jun 19 2019, 4:44 PM · ops-eqiad, SRE
Marostegui added a comment to T221764: Overview of wb_terms redesign.

Thanks for confirming, I just wanted to make sure the planning didn't change, as there have been many migration subtasks and it was hard to keep up with all the plans :-)
I will try to get the db master changed scheduled for July

Jun 19 2019, 3:49 PM · User-Addshore, Wikidata, wb_terms - Tool Builders Migration
Marostegui added a comment to T225643: Schema change to oathauth_users.

All the private wikis have been altered:

advisorswiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
arbcom_cswiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
arbcom_dewiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
arbcom_enwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
arbcom_fiwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
arbcom_nlwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
auditcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
boardgovcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
boardwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
chairwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
chapcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
checkuserwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
collabwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
ecwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
electcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
execwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
fdcwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
grantswiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
id_internalwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
iegcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
ilwikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
internalwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
legalteamwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
movementroleswiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
noboard_chapterswikimedia
  `module` varbinary(255) NOT NULL,
  `data` blob,
officewiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
ombudsmenwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
otrs_wikiwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
projectcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
searchcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
spcomwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
stewardwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
techconductwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
transitionteamwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
wg_enwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
wikimaniateamwiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
zerowiki
  `module` varbinary(255) NOT NULL,
  `data` blob,
Jun 19 2019, 1:31 PM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui updated the task description for T225643: Schema change to oathauth_users.
Jun 19 2019, 1:30 PM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui added a comment to T225643: Schema change to oathauth_users.

As per my chat with @Reedy the code is merged and he's done some testing and it looks good, so I will try to get this schema change done during this week.

Jun 19 2019, 12:47 PM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui added a comment to T213664: correctable memory errors db1068 (commons primary master database).

This host is no longer a master and will be decommissioned in a few days

Jun 19 2019, 10:23 AM · Patch-For-Review, DBA, SRE
Marostegui changed the status of T186188: Failover DB masters in row D, a subtask of T172459: eqiad row D switch upgrade, from Stalled to Open.
Jun 19 2019, 10:23 AM · Infrastructure-Foundations, Patch-For-Review, SRE, netops, Traffic
Marostegui changed the status of T186188: Failover DB masters in row D from Stalled to Open.
Jun 19 2019, 10:23 AM · DBA
Marostegui updated the task description for T186188: Failover DB masters in row D.
Jun 19 2019, 10:22 AM · DBA
Marostegui added a comment to T225981: Replace db1077 with db1112.

db1112 is now cloned from db1077. I am going to let it replicate for 24h before changing sanitarium to replicate from it and to pool it in s3.

Jun 19 2019, 9:20 AM · DBA
Aklapper awarded T224804: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 a Love token.
Jun 19 2019, 9:14 AM · SRE, Regression, Mail, Phabricator
Marostegui closed T224804: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 as Resolved.

Looks fixed then

Jun 19 2019, 9:07 AM · SRE, Regression, Mail, Phabricator
Marostegui added a comment to T224804: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019.

Yeah, it got some delay:

Wed, Jun 19, 2019 at 10:27 AM (Delivered after 1752 seconds)
Jun 19 2019, 9:01 AM · SRE, Regression, Mail, Phabricator
Marostegui added a comment to T221764: Overview of wb_terms redesign.

Thanks for the update @Lea_Lacroix_WMDE - I remember I talked to @alaa_wmde about waiting until we had the new db primary master in place for wikidata on some tasks. I have only been able to find this T219145#5088395 but I reckon we spoke about it somewhere else too.
By reading T221765 I guess this first migration will be only the 1% of all the items, and the rest are still marked as TBD, right?
Once we have swapped the current db master, we can proceed with the rest of the migration, is that still the plan?

Jun 19 2019, 8:57 AM · User-Addshore, Wikidata, wb_terms - Tool Builders Migration
Marostegui updated subscribers of T224804: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019.

Regardless of the query...the email hasn't arrived yet and the script didn't show any errors. So probably some debugging is needed to check what the email is doing //cc @herron
From the exim logs I see the email being sent correctly though.
I have tested sending an email to myself from the CLI and that arrived correctly, however the script doesn't seem to be working correctly as the email isn't arriving to either wikitech-l or myself (I modified the script to send it to me).
Exim still marks it as correctly being sent on logs.

Jun 19 2019, 8:42 AM · SRE, Regression, Mail, Phabricator
Marostegui added a comment to T224804: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019.

So if we want the ones from may we need to modify all the queries on that script to make them to pick the right range, not sure if it is worth the time?

Jun 19 2019, 8:30 AM · SRE, Regression, Mail, Phabricator
Marostegui added a comment to T224804: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019.

I just ran it, but I think it gave the delta between today and 1 month ago as most of the queries are:

Jun 19 2019, 8:28 AM · SRE, Regression, Mail, Phabricator
Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

So this is the change I will push the 25th of June to change the last key: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/517807/
I will follow the same procedure that was followed for the previous two keys.

Jun 19 2019, 7:54 AM · MediaWiki-libs-BagOStuff, Performance-Team (Radar), DBA, User-Marostegui
Marostegui added a comment to P8630 db1077 position.

This is for T225981

Jun 19 2019, 6:58 AM
Marostegui created P8630 db1077 position.
Jun 19 2019, 6:58 AM
Marostegui added a comment to T222978: Compress and defragment tables on labsdb hosts.

The failover was done, so we can probably keep compressing tables.
@jcrespo let me know if you would like to handling this yourself or you want me to take over so you can focus on backups :)

Jun 19 2019, 5:56 AM · Data-Services, DBA
Marostegui closed T224852: Failover s4 primary master: db1068 to db1081 as Resolved.

So far everything looks good, so closing this.

Jun 19 2019, 5:23 AM · SRE, DBA
Marostegui closed T224852: Failover s4 primary master: db1068 to db1081, a subtask of T186188: Failover DB masters in row D, as Resolved.
Jun 19 2019, 5:23 AM · DBA
Marostegui closed T224516: Database primary master failover on s4 (commonswiki) as Resolved.

This happened successfully.
Read only times (UTC):

Jun 19 2019, 5:15 AM · User-notice-archive, User-Johan, Commons, MoveComms-Support (Apr-Jun-2019)
Marostegui closed T224516: Database primary master failover on s4 (commonswiki), a subtask of T224852: Failover s4 primary master: db1068 to db1081, as Resolved.
Jun 19 2019, 5:15 AM · SRE, DBA
Marostegui added a comment to T224852: Failover s4 primary master: db1068 to db1081.

This happened successfully.
Read only times (UTC):

Jun 19 2019, 5:14 AM · SRE, DBA

Jun 18 2019

Marostegui added a comment to T224852: Failover s4 primary master: db1068 to db1081.

Thanks for all the checks!
I will depool db1081 early in the morning, good idea :)

Jun 18 2019, 6:21 PM · SRE, DBA
Marostegui closed T194249: kafka1023 correctable memory errors as Resolved.

This recovered itself - no more issues since 13th May

Jun 18 2019, 1:48 PM · SRE, ops-eqiad
Marostegui claimed T225884: db2084 temporary correctable hardware errors.
Jun 18 2019, 11:54 AM · SRE, DBA
Marostegui moved T225884: db2084 temporary correctable hardware errors from Backlog to Acknowledged on the SRE board.
Jun 18 2019, 10:35 AM · SRE, DBA
Marostegui added a comment to T225884: db2084 temporary correctable hardware errors.

Not yet, I haven't seen more errors but I want to wait until icinga alert clears up, let's give it another 24h

Jun 18 2019, 9:57 AM · SRE, DBA
Marostegui updated the task description for T221533: Decommission old coredb machines (<=db2042).
Jun 18 2019, 8:26 AM · DBA
Marostegui added a project to T225988: decommission db2039: DC-Ops.

This host is ready for DC-Ops to take over and decommission

Jun 18 2019, 8:26 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui reassigned T225988: decommission db2039 from Marostegui to RobH.
Jun 18 2019, 8:25 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui added a parent task for T225988: decommission db2039: T221533: Decommission old coredb machines (<=db2042).
Jun 18 2019, 7:07 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui added a subtask for T221533: Decommission old coredb machines (<=db2042): T225988: decommission db2039.
Jun 18 2019, 7:07 AM · DBA
Marostegui updated the task description for T225988: decommission db2039.
Jun 18 2019, 7:07 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui claimed T225988: decommission db2039.
Jun 18 2019, 7:06 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui created T225988: decommission db2039.
Jun 18 2019, 7:06 AM · Patch-For-Review, DC-Ops, ops-codfw, decommission-hardware, SRE
Marostegui reassigned T220002: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet from Marostegui to RobH.
Jun 18 2019, 6:20 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
Marostegui added a comment to T220002: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet.

it is temporary and it won't last more than 2 days, but ok

Jun 18 2019, 6:18 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
Marostegui added a comment to P8622 db1112 position.

This is part of T225981

Jun 18 2019, 5:58 AM
Marostegui created P8622 db1112 position.
Jun 18 2019, 5:57 AM
Marostegui claimed T220002: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet.
Jun 18 2019, 5:56 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
Marostegui added a comment to T220002: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet.

Assigning this to myself to indicate I am using dbstore1001 for a few days as storing the content of db1112 (test cluster data) temporarily - once I have finished this I will reassign back to Rob

Jun 18 2019, 5:56 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
Marostegui triaged T225981: Replace db1077 with db1112 as Medium priority.
Jun 18 2019, 5:41 AM · DBA
Marostegui moved T225981: Replace db1077 with db1112 from Triage to In progress on the DBA board.

test-cluster users have been notified that on Thursday the replica will go offline to be changed by db1077.

Jun 18 2019, 5:41 AM · DBA
Marostegui created T225981: Replace db1077 with db1112.
Jun 18 2019, 5:40 AM · DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

No differences other than the physical row they're in. They will be able to reach the same resources.

Jun 18 2019, 5:28 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

dbproxy1001-1008 are in the private vlans across row A-B-C, none in D. Is row D private fine for dbproxy1020/1021 or should they be in private-A/B/C ?

Jun 18 2019, 5:23 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

dbproxy1020/1021 can go the same vlans as dbproxy1001-1008 as those will be replacing some of those

Jun 18 2019, 5:18 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

So, to be clear from my side:

Jun 18 2019, 5:12 AM · Patch-For-Review, SRE, DBA
Marostegui moved T225643: Schema change to oathauth_users from Blocked external/Not db team to In progress on the DBA board.
Jun 18 2019, 4:45 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui triaged T225643: Schema change to oathauth_users as Medium priority.
Jun 18 2019, 4:45 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui added a comment to T225643: Schema change to oathauth_users.

I have deployed this change on db1073 for labswiki and labtestwiki just to have it done there in advance to check if something breaks in the next few days.

Jun 18 2019, 4:44 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui claimed T225643: Schema change to oathauth_users.
Jun 18 2019, 4:42 AM · MediaWiki-libs-Rdbms, DBA, MediaWiki-extensions-OATHAuth
Marostegui triaged T222731: Storage problems with new host db1133 as High priority.
Jun 18 2019, 4:41 AM · ops-eqiad, SRE
Marostegui added a comment to T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08.

Great!
So what do you have in mind?

Jun 18 2019, 4:24 AM · SRE, DBA

Jun 17 2019

Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

pc1008 tables optimization finished:

root@pc1008:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.0T  2.4T  46% /srv
Jun 17 2019, 7:49 PM · MediaWiki-libs-BagOStuff, Performance-Team (Radar), DBA, User-Marostegui
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

1018 and 1019 are ok to go to cloud VLAN from my side (as they are in row C)
We just need two hosts on that vlan

Jun 17 2019, 6:22 PM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

Yep! Not a problem, I don't mind which hosts as long as we have two on that VLAN, whichever ones work best for you

Jun 17 2019, 6:19 PM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

@Cmjohnson which ones will go in the cloud vlan finally?
1018 and 1019 or 1020 and 1021?
I'm fine either way but I'm confused with your last comment :)

Jun 17 2019, 6:13 PM · Patch-For-Review, SRE, DBA
Marostegui updated the task description for T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4].
Jun 17 2019, 1:23 PM · Patch-For-Review, DBA
Marostegui removed projects from T196055: Remove table `math` from the database: DBA, SRE.

Removing DBA and SRE, please add the tags back once this is ready to go.

Jun 17 2019, 1:18 PM · Patch-For-Review, DBA, Math
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

No name change? I do not mind, just want to make sure it is a conscious decision.

Jun 17 2019, 10:38 AM · Patch-For-Review, SRE, DBA
Marostegui reassigned T225704: eqiad: rack/setup/install (4) dbproxy systems. from Marostegui to Cmjohnson.

@Cmjohnson I have updated the task with the racking proposal at the beginning.
Thanks!

Jun 17 2019, 10:37 AM · Patch-For-Review, SRE, DBA
Marostegui updated the task description for T225704: eqiad: rack/setup/install (4) dbproxy systems..
Jun 17 2019, 10:36 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

Thanks! I will update the task accordingly to reflect this discussion on top so it is easier for Chris

Jun 17 2019, 10:32 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

Thinking more, as toolsdb was canibalized by openstack, maybe its potential proxies should too. I guess 2/2 is the safe option right now. Sorry, but I didn't think too much about this in advance. Cloud input would be nice of future service expansion and general load balancing/failover needs.

Jun 17 2019, 10:27 AM · Patch-For-Review, SRE, DBA
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Jun 17 2019, 10:25 AM · SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

m5 at the moment doesn't use the proxies (I know it should but they are not being used at the moment) (T202367#5252689)

Jun 17 2019, 9:49 AM · Patch-For-Review, SRE, DBA
Marostegui claimed T225704: eqiad: rack/setup/install (4) dbproxy systems..

Assigning this to myself to let Chris know that this is still blocked on DBAs to decide.
So for now 2 of them will go to replace 1010 and 1011 for sure.

Jun 17 2019, 9:45 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225704: eqiad: rack/setup/install (4) dbproxy systems..

So, 2 of these should go to replace dbproxy1010 and dbproxy1011, right?
If so, we can rack 2 them on the same racks as those (C5) and put them on that same VLAN to do a 1:1 replacement
@jcrespo what do you think?

Jun 17 2019, 9:42 AM · Patch-For-Review, SRE, DBA
Marostegui added a comment to T225391: db1077 crashed.

note db1114 was a host we removed from production because it was unstable. I would vote for another. Did you try depooling and forcing a learning cycle?

Jun 17 2019, 7:25 AM · ops-eqiad, DBA, SRE
Marostegui added a comment to T225391: db1077 crashed.

Also db1114 (test-s1) can be a host we can place instead of db1077 and move db1077 to be test-s1?

Jun 17 2019, 7:21 AM · ops-eqiad, DBA, SRE
Marostegui added a comment to T225391: db1077 crashed.

db1077 has had its BBU in charging status for around 30h now. I have taken a look at the HW logs and:

Jun 17 2019, 7:17 AM · ops-eqiad, DBA, SRE
Marostegui added a comment to T225884: db2084 temporary correctable hardware errors.

Host rebooted. No new logs on HW side.

Jun 17 2019, 6:20 AM · SRE, DBA
Marostegui added a comment to T225884: db2084 temporary correctable hardware errors.

Some more errors from yesterday evening:

[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: event severity: corrected
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:  Error 0, type: corrected
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:  fru_text: A1
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   section_type: memory error
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   physical_address: 0x0000003cb28d7e40
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 504
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: TSC 119d3ab72dafe6
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: ADDR 3cb28d7e40
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: MISC 0
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560720886 SOCKET 0 APIC 0
[Sun Jun 16 21:33:58 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28d7 offset:0xe40 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[Sun Jun 16 21:35:44 2019] mce: [Hardware Error]: Machine check events logged
Jun 17 2019, 6:04 AM · SRE, DBA
Marostegui created P8617 (An Untitled Masterwork).
Jun 17 2019, 5:17 AM