Marostegui (Manuel Aróstegui)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (89 w, 3 d)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF)

TZ: UTC +1/+2

Recent Activity

Today

Marostegui triaged T195193: Schema change for ct_tag_id field to change_tag as Normal priority.
Sun, May 20, 12:03 PM · Blocked-on-schema-change, Wikidata-Ministry-Of-Magic, MediaWiki-Database, MediaWiki-Change-tagging

Fri, May 18

Marostegui closed T194955: Degraded RAID on db1066 as Resolved.
Fri, May 18, 6:57 PM · DBA, ops-eqiad, Operations
Marostegui added a comment to T194955: Degraded RAID on db1066.

This is all good now

root@db1066:~# megacli -LDPDInfo -aAll
Fri, May 18, 6:56 PM · DBA, ops-eqiad, Operations
Marostegui added a comment to T194955: Degraded RAID on db1066.

Thanks Chris

root@db1066:~# megacli -PDRbld -ShowProg -PhysDrv [32:6] -aALL
Fri, May 18, 5:13 PM · DBA, ops-eqiad, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

For the record, after the reboot:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature
Fri, May 18, 3:52 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a parent task for T194781: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion): T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy.
Fri, May 18, 3:50 PM · Patch-For-Review, ops-codfw, DBA, Operations
Marostegui added a subtask for T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy: T194781: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion).
Fri, May 18, 3:49 PM · Patch-For-Review, Operations, Goal, DBA
Marostegui added a parent task for T194780: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion): T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy.
Fri, May 18, 3:49 PM · Patch-For-Review, ops-eqiad, Operations, DBA
Marostegui added a subtask for T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy: T194780: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion).
Fri, May 18, 3:49 PM · Patch-For-Review, Operations, Goal, DBA
Marostegui updated the task description for T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy.
Fri, May 18, 3:49 PM · Patch-For-Review, Operations, Goal, DBA
Marostegui updated the task description for T194870: Failover s2 primary master.
Fri, May 18, 3:45 PM · DBA
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

Still looking good after 10 hours:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK
Fri, May 18, 3:44 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui moved T194955: Degraded RAID on db1066 from Triage to In progress on the DBA board.
Fri, May 18, 2:29 PM · DBA, ops-eqiad, Operations
Marostegui assigned T194955: Degraded RAID on db1066 to Cmjohnson.

Already talked to @Cmjohnson - he will replace it today.
I manually failed it.

Fri, May 18, 2:29 PM · DBA, ops-eqiad, Operations
Marostegui added a comment to T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy.

We are all set for doing the copies to the new hardware once it arrives.

Fri, May 18, 1:50 PM · Patch-For-Review, Operations, Goal, DBA
Marostegui closed T194103: Degraded RAID on db2067 as Resolved.
Fri, May 18, 5:43 AM · DBA, Operations, ops-codfw
Marostegui added a comment to T194634: Decommission db1053.

Let's make sure we label this disk, somehow, as broken when we decommission this host - so it is not reused in the future to replace other disks:

Enclosure Device ID: 32
			Slot Number: 10
Fri, May 18, 5:42 AM · Patch-For-Review, DBA
Marostegui created P7137 (An Untitled Masterwork).
Fri, May 18, 5:37 AM
Marostegui updated the task description for T188299: Schema change for refactored actor storage.
Fri, May 18, 5:31 AM · Patch-For-Review, MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), Data-Services, Blocked-on-schema-change, DBA
Marostegui updated the task description for T191519: Schema change for rc_namespace_title_timestamp index.
Fri, May 18, 5:31 AM · Patch-For-Review, Blocked-on-schema-change, DBA, Wikidata-Ministry-Of-Magic, User-Ladsgroup
Marostegui updated the task description for T190148: Change DEFAULT 0 for rev_text_id on production DBs.
Fri, May 18, 5:31 AM · Patch-For-Review, User-Addshore, Multi-Content-Revisions, Blocked-on-schema-change, DBA
Marostegui added a comment to T190148: Change DEFAULT 0 for rev_text_id on production DBs.

s3 eqiad progress

Fri, May 18, 5:30 AM · Patch-For-Review, User-Addshore, Multi-Content-Revisions, Blocked-on-schema-change, DBA
Marostegui added a comment to T191519: Schema change for rc_namespace_title_timestamp index.

s3 eqiad progress

Fri, May 18, 5:30 AM · Patch-For-Review, Blocked-on-schema-change, DBA, Wikidata-Ministry-Of-Magic, User-Ladsgroup
Marostegui added a comment to T188299: Schema change for refactored actor storage.

s3 eqiad progress

Fri, May 18, 5:30 AM · Patch-For-Review, MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), Data-Services, Blocked-on-schema-change, DBA
Marostegui updated the task description for T188299: Schema change for refactored actor storage.
Fri, May 18, 5:28 AM · Patch-For-Review, MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), Data-Services, Blocked-on-schema-change, DBA
Marostegui updated the task description for T191519: Schema change for rc_namespace_title_timestamp index.
Fri, May 18, 5:28 AM · Patch-For-Review, Blocked-on-schema-change, DBA, Wikidata-Ministry-Of-Magic, User-Ladsgroup
Marostegui updated the task description for T190148: Change DEFAULT 0 for rev_text_id on production DBs.
Fri, May 18, 5:28 AM · Patch-For-Review, User-Addshore, Multi-Content-Revisions, Blocked-on-schema-change, DBA
Marostegui closed T193847: Move db1066 to row A as Resolved.

Server repooled
Thanks Chris for getting this done!

Fri, May 18, 5:27 AM · Patch-For-Review, ops-eqiad, Operations, DBA
Marostegui closed T193847: Move db1066 to row A, a subtask of T194870: Failover s2 primary master, as Resolved.
Fri, May 18, 5:27 AM · DBA
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

After reboot:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature                             : OK
Fri, May 18, 5:26 AM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

Looks like it was a one time thing:

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 47 C
  Temperature                             : OK
Fri, May 18, 5:18 AM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T194103: Degraded RAID on db2067.

This time it worked

logicaldrive 1 (3.3 TB, RAID 1+0, OK)
Fri, May 18, 5:17 AM · DBA, Operations, ops-codfw
Marostegui reassigned T194341: SELECT query on page table appears to also reference revision table from Marostegui to Bstorm.
Fri, May 18, 5:04 AM · Data-Services

Thu, May 17

Marostegui closed T194885: Degraded RAID on db1064 as Resolved.

This is now fixed, I am going to fail the other disk and a new task will be created

Thu, May 17, 3:28 PM · DBA, ops-eqiad, Operations
Marostegui updated the task description for T194870: Failover s2 primary master.
Thu, May 17, 3:09 PM · DBA
Marostegui added a comment to T194103: Degraded RAID on db2067.

Cross your fingers!

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding)
Thu, May 17, 2:55 PM · DBA, Operations, ops-codfw
Marostegui merged T194886: Degraded RAID on db2067 into T194103: Degraded RAID on db2067.
Thu, May 17, 2:44 PM · DBA, Operations, ops-codfw
Marostegui merged task T194886: Degraded RAID on db2067 into T194103: Degraded RAID on db2067.
Thu, May 17, 2:44 PM · Operations, ops-codfw
Marostegui reassigned T194103: Degraded RAID on db2067 from Marostegui to Papaul.

That disk has failed :(

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed)
Thu, May 17, 2:42 PM · DBA, Operations, ops-codfw
Marostegui triaged T194885: Degraded RAID on db1064 as Normal priority.

Disk #2 was manually failed to get it replaced.
It has been swapped and it is rebuilding:

Thu, May 17, 2:40 PM · DBA, ops-eqiad, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

This is still working fine - maybe it was a one time thing?

root@db1067:~#  megacli -AdpBbuCmd -a0  | grep Temper
Temperature: 48 C
  Temperature                             : OK
Thu, May 17, 2:29 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui updated the task description for T188299: Schema change for refactored actor storage.
Thu, May 17, 2:27 PM · Patch-For-Review, MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), Data-Services, Blocked-on-schema-change, DBA
Marostegui updated the task description for T191519: Schema change for rc_namespace_title_timestamp index.
Thu, May 17, 2:27 PM · Patch-For-Review, Blocked-on-schema-change, DBA, Wikidata-Ministry-Of-Magic, User-Ladsgroup
Marostegui updated the task description for T190148: Change DEFAULT 0 for rev_text_id on production DBs.
Thu, May 17, 2:27 PM · Patch-For-Review, User-Addshore, Multi-Content-Revisions, Blocked-on-schema-change, DBA
Marostegui added a comment to T193847: Move db1066 to row A.

This has been moved.
So far no BBU issues or anything related.
I am waiting for MySQL to catch up and the DNS to fully propagate before repooling this host.

Thu, May 17, 2:08 PM · Patch-For-Review, ops-eqiad, Operations, DBA
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

I have set back the default policy to WriteBack and WriteThru if the BBU is not present/broken. So the host is as it was before all the issues.

Thu, May 17, 12:55 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

After 10 minutes the temperature reached 45C

Thu, May 17, 12:52 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

For the record after having the server powered off for 1 hour:

Thu, May 17, 12:41 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a subtask for T194870: Failover s2 primary master: T194867: BBU issues on db1054 (s2 primary master).
Thu, May 17, 9:00 AM · DBA
Marostegui added a parent task for T194867: BBU issues on db1054 (s2 primary master): T194870: Failover s2 primary master.
Thu, May 17, 9:00 AM · DBA
Marostegui added a parent task for T193847: Move db1066 to row A: T194870: Failover s2 primary master.
Thu, May 17, 9:00 AM · Patch-For-Review, ops-eqiad, Operations, DBA
Marostegui added a subtask for T194870: Failover s2 primary master: T193847: Move db1066 to row A.
Thu, May 17, 9:00 AM · DBA
Marostegui updated the task description for T194870: Failover s2 primary master.
Thu, May 17, 8:48 AM · DBA
Marostegui raised the priority of T193847: Move db1066 to row A from Normal to High.

Given that db1054 (the primary master is having BBU issues), we should move this host asap to the new rack and prepare for a failover.
@Cmjohnson can we do this movement today?

Thu, May 17, 8:48 AM · Patch-For-Review, ops-eqiad, Operations, DBA
Marostegui moved T194870: Failover s2 primary master from Triage to Next on the DBA board.
Thu, May 17, 8:47 AM · DBA
Marostegui triaged T194870: Failover s2 primary master as High priority.
Thu, May 17, 8:47 AM · DBA
Marostegui added a parent task for T194867: BBU issues on db1054 (s2 primary master): T186320: Decommission db1051-db1060 (DBA tracking).
Thu, May 17, 8:29 AM · DBA
Marostegui added a subtask for T186320: Decommission db1051-db1060 (DBA tracking): T194867: BBU issues on db1054 (s2 primary master).
Thu, May 17, 8:29 AM · Patch-For-Review, DBA
Marostegui moved T194867: BBU issues on db1054 (s2 primary master) from Triage to In progress on the DBA board.
Thu, May 17, 6:16 AM · DBA
Marostegui added a comment to T194867: BBU issues on db1054 (s2 primary master).

After forcing a Re-learn cycle:

˜/icinga-wm 7:54> RECOVERY - MegaRAID on db1054 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
Thu, May 17, 5:55 AM · DBA
Marostegui updated the task description for T192979: Productionize 8 eqiad hosts.
Thu, May 17, 5:54 AM · DBA
Marostegui added a comment to T192979: Productionize 8 eqiad hosts.

This has been disabled everywhere:

db1116
  Auto-Learn Mode: Disabled
Thu, May 17, 5:54 AM · DBA
Marostegui added a comment to T194867: BBU issues on db1054 (s2 primary master).

This is the battery status

BBU status for Adapter: 0
Thu, May 17, 5:19 AM · DBA
Marostegui created T194867: BBU issues on db1054 (s2 primary master).
Thu, May 17, 5:18 AM · DBA

Wed, May 16

Marostegui added a comment to T194852: Possibly BBU issues on db1067.

These are the logs from the BBU after the reboot for the rack change

Wed, May 16, 8:20 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

I have manually set the policy to WriteBack so at least the server can catch up and not lag forever:

root@db1067:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll
Wed, May 16, 8:14 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T194852: Possibly BBU issues on db1067.

The BBU is definitely having some issues, I cannot even force a relearn:

root@db1067:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog
Wed, May 16, 8:11 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui triaged T194852: Possibly BBU issues on db1067 as High priority.
Wed, May 16, 8:08 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui created T194852: Possibly BBU issues on db1067.
Wed, May 16, 8:08 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui closed T193835: Move db1067 to row C as Resolved.

As spoken with @Cmjohnson I am closing this task and create a new one for the BBU issues. As it will be easier to look for it in the future with an specific task

Wed, May 16, 8:04 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T193835: Move db1067 to row C.

The temperature of the BBU is super high compare to other hosts, so I think we should probably replace it with another one. As this is the candidate master for s1, better to be on the safe side, and better to replace the BBU now that it is not a master yet.

Wed, May 16, 7:57 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T193835: Move db1067 to row C.

I am investigating why it has the cache policy set to WriteThru

Wed, May 16, 7:56 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T193835: Move db1067 to row C.

This has been successfully moved.
MySQL is back up, I am waiting for the DNS to totally propagate before repooling and closing this task

Wed, May 16, 4:39 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui created P7135 (An Untitled Masterwork).
Wed, May 16, 3:31 PM
Marostegui moved T193835: Move db1067 to row C from Next to In progress on the DBA board.
Wed, May 16, 10:32 AM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui moved T194273: Clean up indexes of wb_terms table from Backlog to Next on the DBA board.
Wed, May 16, 10:27 AM · MW-1.32-release-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)), Patch-For-Review, DBA, MediaWiki-extensions-WikibaseRepository, Wikidata
Marostegui moved T194270: Drop 'tmp1' index from wb_terms table in production from Backlog to Next on the DBA board.
Wed, May 16, 10:27 AM · DBA, MediaWiki-extensions-WikibaseRepository, Wikidata
Marostegui updated the task description for T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking).
Wed, May 16, 10:27 AM · Epic, DBA, Tracking
Marostegui closed T194663: Drop unused tables: msg_resource msg_resource_links as Resolved.

Dropped everywhere

Wed, May 16, 10:26 AM · DBA
Marostegui closed T194663: Drop unused tables: msg_resource msg_resource_links, a subtask of T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking), as Resolved.
Wed, May 16, 10:25 AM · Epic, DBA, Tracking
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 10:25 AM · DBA
Marostegui moved T194634: Decommission db1053 from Triage to In progress on the DBA board.
Wed, May 16, 10:16 AM · Patch-For-Review, DBA
Marostegui moved T194780: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) from Triage to Next on the DBA board.
Wed, May 16, 10:16 AM · Patch-For-Review, ops-eqiad, Operations, DBA
Marostegui moved T194781: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) from Triage to Next on the DBA board.
Wed, May 16, 10:16 AM · Patch-For-Review, ops-codfw, DBA, Operations
Marostegui updated the task description for T190148: Change DEFAULT 0 for rev_text_id on production DBs.
Wed, May 16, 9:42 AM · Patch-For-Review, User-Addshore, Multi-Content-Revisions, Blocked-on-schema-change, DBA
Marostegui updated the task description for T191519: Schema change for rc_namespace_title_timestamp index.
Wed, May 16, 9:42 AM · Patch-For-Review, Blocked-on-schema-change, DBA, Wikidata-Ministry-Of-Magic, User-Ladsgroup
Marostegui updated the task description for T188299: Schema change for refactored actor storage.
Wed, May 16, 9:42 AM · Patch-For-Review, MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), Data-Services, Blocked-on-schema-change, DBA
Marostegui created P7133 (An Untitled Masterwork).
Wed, May 16, 8:21 AM
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 7:26 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 7:21 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 6:34 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 6:20 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 6:17 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 6:16 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 6:11 AM · DBA
Marostegui updated the task description for T194663: Drop unused tables: msg_resource msg_resource_links.
Wed, May 16, 6:10 AM · DBA
Marostegui added a comment to T190425: GlobalPreferences deploy caused a significant increase in reads on s3.

It is looking a lot better now. I would suggest to wait for 24h and then deploy to some more wikis and see how it goes.

Wed, May 16, 6:01 AM · Community-Tech-Sprint, MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)), Patch-For-Review, MediaWiki-extensions-GlobalPreferences
Marostegui updated the task description for T192979: Productionize 8 eqiad hosts.
Wed, May 16, 5:49 AM · DBA
Marostegui added a comment to T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy.

eqiad is now ready with all the data on multi-instance, so as soon as the final HW arrives we can just stop them and clone them

Wed, May 16, 5:49 AM · Patch-For-Review, Operations, Goal, DBA
Marostegui reassigned T194103: Degraded RAID on db2067 from jcrespo to Papaul.

It is indeed on predictive failure:

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure)
Wed, May 16, 5:21 AM · DBA, Operations, ops-codfw